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TRIAL AT YALE UNIVERSITY OF THE ARMED 
FORCES INSTITUTE GENERAL EDUCA- 
TIONAL DEVELOPMENT TESTS 


ALBERT B. CRAWFORD 
AND 
PAUL S. BURNHAM 
Yale University 

Amonc several major problems relative to demobilization 
with which our colleges are immediately confronted is that of 
evaluating, fairly and with reasonable assurance, the scholastic 
promise of individuals now serving the Armed Forces. Edu- 
cational provisions of the “G.I. Bill” in due course undoubtedly 
will bring to the colleges a flood of applicants for admission. 
Among these will be many “irregulars’—those with broken or 
deferred educational histories in the formal sense, yet with valu- 
able training or new skills acquired in service. Since these 
personal developments will be difficult, if not impossible, to 
assay in traditional coinage of the academic world (“units” 
and “credits”), colleges at last may be impelled to consider 
what the prospective student knows, and what educational 
aptitudes he can demonstrate, regardless of where or how these 
attributes have been obtained. 

The Armed Forces Institute has made educational programs 
of extensive scope available throughout all major theatres of 
training and operation. Certificates attesting satisfactory 
completion of particular courses, or achievement-test scores in 
specific fields, are issued through “G.H.Q.” for this purpose, at 
the University of Wisconsin, Madison. However, the Armed 
Forces Institute measure most useful to colleges in their selec- 
tion of future, demobilized students appears to be the General 
Educational Development Battery of four tests, viz: 
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Test 1) Correctness and Effectiveness of Expression 

Test 2) Interpretation of Reading Materials in the 
Social Sciences 

Test 3) Interpretation of Reading Materials in the 
Natural Sciences 

Test 4) Interpretation of Literary Materials. 


No further discussion of these tests is necessary here, since 
they have been widely publicized and are fully described in the 
Examiners’ Manual issued by the American Council on Edu- 
cation... This Manual contains valuable remarks concerning 
the nature and purposes of the General Educational Develop- 
ment Battery, and presents tentative college norms for each 
test. It stresses the need for local norms as well, stating: 

These norms are intended to help the schools decide what mini- 
mum test performance should entitle returning war-service persons to 
a given amount of academic credit in a given area. Because of the 
heterogeneity just noted, however, there is danger that certain schools, 
by uncritical acceptance of the general norms here reported, may set 
standards inconsistent with their local needs. It is highly desirable, 
therefore, that each institution, by administering the civilian forms to 
its own civilian students establish its own local norms for use along 
with the general norms in the interpretation of scores of service men 
and women reported to the institution by the United States Armed 
Forces Institute. 


To that end, particularly as it might affect the program of 
Yale Studies for Returning Service Men, an experiment was 
recently conducted at Yale University. All members of the 
civilian freshman class which had matriculated in July 1944 
were invited to participate; 135, or about one-third of the total, 
did so. Although the invitation put no pressure upon fresh- 
men—since only interested volunteers were desired—it stated: 
a) that participation would be of direct service to the Univer- 
sity and collaterally to men now in the Armed Forces; b) that 
reasonable compensation would be given for the time expended. 

1 For further information on this Battery and other phases of the Armed Forces 
Institute so-called “Fox-hole University” see: The United States Armed Forces Insti- 
tute, Tests of General Educational Development (College Level) Examiners’ Manual, 
American Council on Education, 1944; and Guide to the Evaluation of Educational 
Experiences in the Armed Service, American Council on Education, Washington, 


D. C., 1944. 
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For each of these four tests comprising the A.F.I. General Edu- 
cational Development Battery? two hours are allowed. 

In appraising their performance, it is essential to note how 
well the test group represented, in scholastic promise and 
achievement, the entire class. Fortunately it was found that 
the random sample of participants in this experiment covered 
virtually the entire class range, from highest to lowest, in aca- 
demic standing, for the first complete term of Freshman Year. 
Average score of this group on the College Entrance Examina- 
tion Board’s Scholastic Aptitude Test was 575 and on other 


TABLE 1 
AFI Test Group Compared with Total Freshman Class of 1947 











Mean S.D. 

CEEB Verbal Scores 

a) Class of 1947 (N=433) ............6- 555 98 

b) AFI Test Group (V=135) .......... 575 101 
CEEB Mathematical Scores 

a) Class of 1947 (N=430) .............. 583 80 

b) AFI Test Group (V=135) .......... 595 84 
General Scholastic Prediction 

a) Class of 1947 (N=427)* ............ 73.3 7.4 

b) AFI Test Group (V=135) ........... 74.6 7.1 
Freshman First Term Average 

a) Class of 1947 (N= 4i1)* Rei A iy. ool 75.3 8.4 

b) AFI Test Group (V=135) ........... 76.5 7.8 





* Irregularities in secondary-school record obviated scholastic predictions in some 
cases; elimination, chiefly for military service, among freshman matriculants naturally 
— for the difference between 433 and 411 in total class numbers represented 
tests of the College Board or the Yale Educational Aptitude 
Battery of the same order—around .20 standard deviation- 
above the class mean. Hence these voluntary participants 
ranked, in pre-matriculation indices and later scholastic 
achievement somewhat above the median of Yale freshmen; 
i.e., around percentile 60 rather than 50 in academic promise 
and accomplishment. Moreover, despite relatively high aver- 
age performance of this freshman group on the A.F.I. tests, 
dispersion of their scores followed a normal-probability pattern. 


2 The Battery employed is that obtainable through the Cooperative Test Service, 
15 Amsterdam. Avenue, New York City—a parallel form to that actually used by the 
Armed Forces Institute. 
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Time Allowances 


Maximum time allowed for each test is two hours. As 
indicated in Table 2, many Yale students finished in consider- 
ably less time. The low correlations there shown also indicate 
that length of time spent on each test bore little relationship 
to score made by these freshmen. The time allowances are 
generous and the tests (as intended) therefore seem to depend 
on “power” rather than on speed. 


TABLE 2 


Correlation of AFI Test Scores with Testing Time for Yale Freshman 
Test Group (N =135) 











Average r with 

time time 
AFI Test I (Expression) ............. 65 min. -.12 
AFI Test IV (Literature) ............ 75 min. 5 
AFI Test II (Social Studies) .......... 80 min. -.01 
AFI Test III (Natural Sciences) ...... 93 min. 04 





Intercorrelations among the Armed Forces Institute Tests 


Since all four tests of this battery, including that designated 
as “Interpretation of Reading Materials in the Natural Sci- 
ences” are highly verbal, a considerable degree of positive inter- 
correlation among them would be expected. The Examiners’ 
Manual rather surprisingly offers no data in this respect. The 
following table presents these, for the Yale freshman test group. 
The correlations in this and subsequent tables were obtained 
from standard scores for the AFI tests, derived from raw scores 
by use of the conversion tables furnished in the Examiners’ 


Manual. 
TABLE 3 
Intercorrelations among AFI Tests (N=135) 











T . (Social (Natural 
ests . ocia atura 
(Literature) Studies) Science) 
I (Expression) ........... 57 47 48 
DV SERENE) keen csc a 63 52 
II (Social Studies) ........ ae os 57 





This correlation matrix indicates rather clearly (on the 
basis of a small but representative Yale sample) that the 
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Armed Forces Institute General Educational Development bat- 
tery is general, rather than specific or differential, throughout 
its several parts. We hasten to add that no derogation is im- 
plied by the foregoing statement. It is simply not in the na- 
ture of this Battery, nor in the aims of its distinguished con- 
structors, to measure differential rather than general scholastic 
aptitudes; which indeed should be clear from its title and from 
objectives discussed in the Examiners’ Manual. Nevertheless, 
these coefficients are not so high as to suggest that all four tests 
measure the same mental functions. It may be that similarity 
in form (i.e., the fact that all four are highly verbal) accounts 
for the intercorrelations, and that dissimilarity in content 
accounts for their not being even higher. 


Comparison of Armed Forces Institute with College 
Entrance Examination Board Tests 


Although the content and objectives of A.F.I. General Edu- 
cational Development Tests differ considerably from those of 
College Entrance Examination Board measures, certain ele- 
ments of the two series, from Yale evidence, appear functionally 
to have much in common. A chief difference of objectives 
concerns recency of formal schooling; i.e., College Board ex- 
aminations are meant to be taken in stride by students nearing 
completion of their college-preparatory work and therefore to 
a substantial degree quite properly measure current, specific 
achievements. The A.F.I. educational development battery, 
no less properly, attempts to measure combined aptitude and 
achievement (however acquired) in more general terms. Yet 
the following table indicates a rather high degree of corre- 
spondence among those parts of each test-series which by title 
seemingly represent analogous educational fields. The Yale 
experiment affords direct comparison in this respect of the two 
testing methods. Candidates for admission to Yale are nor- 
mally required to take the College Board Scholastic Aptitude 
Test (both verbal and mathematical sections) and the English 
Essay Examination. They distribute themselves rather widely 
among the other College Board options, according to subjects 
of study in the senior-high-school year and intended college 
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program. Consequently some of the correlations represented 
below, and others in Table 3, are based upon too small a num- 
ber of cases to warrant general conclusions. By way of com- 
parison, first-term versus second-term grades in the Freshman 
Year (Class of 1945W) correlated: English 10, .61; History 10, 
.73; Chemistry 11 and 14, .57; Physics 10, .76. 

Despite the numerical limitations of these data in some 
categories, it appears evident that the Armed Forces Institute 
tests correlate surprisingly well with those of the College En- 
trance Examination Board in related areas—especially when 


TABLE 4 
Correlations between CEEB and AFI Tests 














AFI Tests 

I IV II III 

CEEB Tests N (Expres- (Litera- (Social (Natural 

sion) ture) Studies) Science) 

T if T - 

BAS AVGIM) occcesace 135 63 74 64 56 
English Essay ......... 135 32 30 ne ate 
Social Studies ......... 34 ss si 71 i 
Oe ar 54 7 <= ne 77 
PUNE Gs Saco beeen 58 bs cs ss 42 





Norte: Average of all CEEB Tests correlated .72 with AFI Total score (V = 135); 
for the means and standard deviations associated with the data of Table 4, the reader 
is referred to Table 4a at the end of this article. 
allowance is made for the high average performance and there- 
fore somewhat reduced “spread” of Yale freshmen on the A.F.I. 
Battery. 


Relationship of Scores on A.F.I. General Educational Develop- 
ment Battery and Certain College Board Tests 
to First-Term Grades at Yale 


A pragmatic and customary, though by no means ideal, 
method of evaluating tests is to correlate scores thereon with 
subsequent grades in course. The latter, however carefully 
assigned, because of their subjective nature have dubious sta- 
tistical reliability. However, in many situations they offer 
the fairest criterion by which the relative value of different 
prognostic measures may be compared. The next exhibit 
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(Table 5) presents a series of correlations between test scores 
and first-term grades, of participants in this experiment. 

It is recognized that several factors bearing upon compari- 
son of the College Entrance Examination Board tests and 
U.S.A.F.I. General Educational Development Tests in Tables 
4 and 5 may affect the results there given. All 135 men took 


TABLE 5 


Correlation of CEEB and AFI Tests with Freshman-Y ear, 
First-Term Grades 
(Data based on AFI Test Group of 135 Freshmen) 























Course Test r N 
English 10a AFI I (Expression) .............. 50 100 
AFI of _ Bcwenwewenwes 54 100 

CPG oy oes nas cnn cecsic cosine 53 100 

CEEB English NAY oacie aswacas 36 99 

History (All Courses) AFT ITI (Social Studies) ........... 52 43 
6), 0) 0S a a eer ee eer 43 43 

Physics 11-12a AFI III (Natural Science) ........ A8 44 
Og 58 28 

Mathematics 12a AFI I (Expression) .............. 46 78 
AFI IV (Literature) ............. 30 78 

AFI II (Social Studies) ........... 7 78 

AFI III (Natural Science) ........ 30 78 

REE cere caine vivlecsieniees 59 78 

Engr. Drawing 10a CREB SOs, is cie nis 0s cincscceiee 62 35 
Freshman First- AFI I (Expression) .............. 51 135 
Term Average AFI IV (Literature) ............. Al 135 
AFI ITI (Social Studies) ........... 50 135 

AFI III (Natural Science) ........ 36 135 

BET FOtal SCOPE os. ccc cceccsccess 56 135 

CRE rte ss alen ines so-saie Al 135 

*CEEB Verbal Average ........... 40 135 

+tCEEB General Average .......... 44 135 

Average of all CEEB tests ........ 53 135 





* Average of CEEB Language, Social Studies and English Essay tests. 

+ Average of all College Board achievement tests. 

Note: For the means and standard deviations associated with the data of Table 
5, the reader is referred to Table 5a at the end of this article. 


the same A.F.I. tests but all did not take the same College 
Board tests. They all did take three of the latter, Scholastic 
Aptitude Test (Verbal), Mathematical Aptitude and English 
Essay. The remaining two College Board tests taken by each 
candidate for admission are elections determined by his secon- 
dary school course and prospective college major and are drawn 
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from the following options: Social Studies; French, German, 
Spanish, or Latin Reading; Biology, Chemistry, or Physics; 
Spatial Relations (Three-Dimensional) Visualizing. Conse- 
quently the average on College Board Examinations is not 
based upon an identical battery for all men represented in this 
study. Each candidate’s options, however, are appropriate to 
his expected freshman program. 

Most of the students represented took their College Board 
tests in April, 1944, whereas the A.F.I. battery was admin- 
istered near the end of the first term. Consequently, the in- 
terval between taking College Board tests and obtaining the 
final grades for the term (six months) was considerably greater 
than that between A.F.I. tests and the end of term (one 
month). ‘There is no way of estimating to what extent these 
differing time intervals may have affected correlations of the 
respective tests scores with the criteria. 

Again the number of cases upon which some of these cor- 
relations are based permits only tentative conclusions. It does 
appear, however, that: 

1) A.F.I. tests show promise of being wholly acceptable 
alternates for College Board examinations in the verbal sub- 
jects. 

2) For this sample group, A.F.I. Total Score correlated as 
well with Freshman first-term average in all courses, as did the 
average of all College Board tests. 

3) The A.F.I. General Educational Development Battery 
makes no pretense of measuring abilities in Mathematics, Me- 
chanical Drawing, or Descriptive Geometry. The College 
Board M.A.T. and Spatial Relations tests are probably indis- 
pensable for prospective scientific or engineering majors. We 
have obtained no information regarding A.F.I. tests in specific 
branches of mathematics, but these would have pertinence 
chiefly for students who have completed those particular 
courses. 

4) Since College Board M.A.T. scores are likewise some- 
what dependent upon previous mathematical training, inter- 
ruption thereof might adversely affect individual performance 
on this test also. By its nature, the Spatial Relations score is 
less likely to be so affected. 
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From joint inspection of the A.F.I. test scores and freshman 
grade distributions, it was a relatively simple matter to de- 
termine two critical-score levels. Practically no students scor- 
ing above the upper one had unsatisfactory first-term records, 
while most of those scoring below the other ranked well under 
the class average. As a result of this investigation it was voted 
by the Executive Committee of Yale Studies for Returning Ser- 
vice Men to admit candidates scoring above the higher critical 
level on the A.F.I. Battery despite their deficiency in formal 
academic credits; to reject or discourage (unless additional evi- 
dence more favorable to their chances of success is submitted ) 
those scoring below the lower critical level; and to examine by 
other means border-line cases falling between these upper and 
lower levels. 

For the reasons noted above in Conclusion (3) further tests 
(of the aptitude rather than formal achievement type) will be 
required of prospective engineering, mathematics or physical 
science majors. 


TABLE 4a 
Means and Standard Deviations of Table 4 Data 











Variables correlated N Mean S.D. 
a) CEEB SAT and English Essay with AFI tests 
ES nin cccis es bce oo w same Meeeee 135 57.5 10.1 
CEEB English Essay .................. 135 55.5 8.5 
Pee vin slicins «als a Gieieseedieicaeisinus 135 62.9 6.2 
Pea Mas slasakc cede sandman 135 64.7 6.1 
PRN eid slow ie he Ka NS daa amas 135 69.5 7.7 
CAME aoc caicla iis cele Keg aisdie whe oemessle 135 73.3 55 
b) CEEB Social Studies and AFI II 
CEEB Social Studies ..................- 34 59.6 92 
Ne ae ee ng yee wien eeeee 34 69.3 8.1 
c) CEEB Chemistry and AFI III 
CEU CROMBEY .icccsccaccsscsscceses 54 56.2 8.8 
PME NEI Cg cine Saitek ins HA ce: sTakowiele tes 54 72.8 6.1 
d) CEEB Physics and AFI III 
CE DMEM Scots atc cin aw Posies acin ean we 58 58.5 79 
Pe Ra ae hss ca ev ewe geteenvion ee 58 74.3 4.4 
e) CEEB average and AFI total score 
Average of all CEEB tests .............. 135 57.4 6.3 
APE GOtGl COTE oos css vcisiccdccscveese ces 135 270.4 20.9 











270 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 















TABLE 5a 
Means and Standard Deviations of Table 5 Data 
Variables correlated N Mean S.D. 
a) English Grades, AFI and CEEB tests 
1 RUNS SEO) on.5550s4ceeenus sc 100 74.4 7.7 
J | SS er re rer 100 62.0 5.9 
NN ek ona a a ok ono aap aie 100 63.8 5.9 
EINE Core oo so eins ois Sess ki 100 55.0 8.8 
English Grades (10a) ...........020000. 99 74.4 7.7 
Re BD DAERY: ons occivescccsvanes 99 54.3 8.5 
b) History Grades, AFI and CEEB tests 
History Grades (all courses) ..... ...... 43 77.2 8.2 
MEME Es oo hn OL RRA wenn baba sess 43 69.4 7.6 
GUM, hos SG ca euwesescnesiesexs 43 55.8 9.9 
c) ~ Grades, AFI and CEEB tests 
Physics Grades (11-12a) ............... 44 78.4 12.2 
PRN oe ec ununin sees se scseux es 44 74.7 46 
Physics Grades (11-12a) ............... 28 78.6 11.5 
CS I io inn nse es wink e050 28 60.4 9.1 
d) Mathematics Grades, AFI and CEEB tests 
_— Grades (12a) ...0....000005 78 73.7 13.3 
Po 5 baw eon eeu ae b<ikk habeas 78 63.0 6.2 
MEN: Seog e sys cer svinsseobane ves tes 78 64.4 6.4 
NN to eli eA aan aes bie 78 69.1 79 
Nc bic ncn als cr sha nix eos awn less es 78 74.1 5.1 
Co LY ee errs 78 60.0 9.4 
e) Engineering Drawing Grades and CEEB scores 
Engineering Drawing Grades (10a) ...... 35 81.5 7.8 
cS Ee ee rrr 35 57.5 9.5 
f) Freshman Averages and CEEB tests 
Freshman Average (First Term) ........ 135 76.5 7.8 
eee nr Rea ghintacinane eee 135 62.9 6.2 
US os BAR ok oas en Ges 135 64.7 6.1 
RE ee ky Cw waeeen's's 135 69.5 7.7 
ES Ris ac chet avila Waku wnion ea bie 135 73.3 5.5 
Sd ie Eo a. eo ee ne 135 270.4 20.9 
8 2. ES ear 135 57.5 10.1 
CEEB Verbal Average ................. 135 55.4 7.4 
CEEB General Average ............++:- 135 56.2 re 


Average of all CEEB tests .............. 135 57.4 

















THE CRITERION 


HERBERT A. TOOPS 
The Ohio State University 


In all test work, and in making predictions generally, there 
must be—according to the current mode of the thought and 
its consequent development into corresponding formulas—a 
unitary, general success score, or criterion score, for each person 
of the experimental group by whose aid the tests are con- 
structed, combined, and validated. 

Item analysis for selecting the better items requires such a 
criterion. So does item alternative analysis, done with the 
purpose of altering, in the hope of improving, those confusion 
alternatives which are “out of line,” which have, for example, 
positive item-alternative validity coefficients instead of the 
much-to-be-desired negative ones. For a multiple-choice test 
we desire such alternatives—for right answer and confusion 
alternatives—that the validity of each item may be as high as 
possible; or, to put it in another fashion, that as many as possi- 
ble of the experimental items shall have a validity coefficient 
above some arbitrary lower limit, say .24 for tests of intelligence 
at the university freshmen level. In the former case, which 
presupposes the second having already been satisfactorily ac- 
complished, we are concerned with the picking of the individual 
items such that, for the 7 items chosen for a given test, the cor- 
relation of the sum of the scores (the Rights score) on the 
items-as-a-whole (i.e., the test as finally constituted) and the 
criterion shall be a maximum. If the intercorrelations of the 
items are low, approximating zero, we may be content to ignore 
them; but we can hardly proceed at all without validity coefh- 
cients, or some reasonably adequate substitute therefor, the 
indices of which of necessity also require a criterion for their 
computation. 
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Having arrived at » “pretty good” items, each of desirable 
alternatives, we may hope, if the test is a time-limit test, to 
improve further the validity of each sub-test by utilizing scor- 
ing formulas, which, if we are to use them, “must discount 
errors” (i.e., C must be negative in S=R+C-W). This too 
presupposes a criterion because the weight of Errors, (EZ), of 
Wrongs, (W), i.e., errors (E) plus omissions (QO) relative to 
Rights, can be ascertained (as vs. judged arbitrarily) only by 
the aid of a criterion and the multiple regression equation. The 
selection of tests for a battery presupposes that considerable 
numbers of sub-tests are tried out against the criterion as a sort 
of ultimate measuring rod, and that only “the few best” are 
chosen. 

To weight these chosen tests in a multiple regression equa- 
tion then requires also a criterion. The purpose here is to so 
weight the several sub-tests that a maximum validity of the 
composite or “scale score” results. Even if the weights are arbi- 
trarily chosen a criterion is needed in order to compute a 
validity coefficient of the thus arbitrarily weighted scale. 

Finally, then, we may say that we require a criterion for the 
general purpose of maximizing or optimalizing all sorts of rela- 
tionships in the problem of prediction; for determining the poor 
alternatives and choosing the better items; for determining the 
best scoring formula; for selecting the better sub-tests; for de- 
ciding the weights to be ascribed to each in a battery or “scale”; 
for determining the validity coefficients of batteries or scales 
of tests weighted either arbitrarily or by multiple regression 
equations; for determining the “most causal” of all the predic- 
tive variables (inclusive of traits and environmental factors or 
variables and social relationships) and determining their rela- 
tive weights; and for determining what basic form of relation- 
ship (including therein “higher forms of the multiple regression 
formula”) underlies the best combination of all the “causal” 
variables. For all such purposes a unitary criterion score for 
each person of our experimental population is indispensable. 

Yet success is not unitary! A concrete consideration of 
cases will emphasize the point. Miss A is, let us say, +1 ¢ in 
speed-of-typing, but is — 1 o in accuracy-of-typing (that is, she 
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“is speedy but inaccurate”). Miss B is —1 o in speed-of-typing 
but is + 1 o in accuracy-of-typing (and popularly, “is slow but 
accurate”). If for each of these persons we add the scores, or 
average the scores, or (for these particular two persons) weight 
the scores equally, in all three circumstances they turn out to 
be 0 c on the composite-of-speed-and-accuracy variable, or suc- 
cess-as-a-typist variable; that is to say, both are “mediocre.” 
Yet by no reasonable stretch of the imagination are Miss A and 
Miss B equally “successful” typists. The more true statement 
is to say that they have a different kind, or a different type or 
a different pattern of success. Most employers of stenogra- 
phers will prefer for most purposes Miss B, because they hate 
errors, they hate the correction of mistakes and the complaints 
of customers and clients which inevitably result when errors are 
made. And besides, even if Miss B does less work, still very 
little of it has to be done over again. And doing work over 
takes much time—much more than enough to do it correctly 
in the first place—and costs money.* 

Miss A and Miss B do have a different profile of typing suc- 
cess. This suggests that, possibly, if we could hold them to the 
same speed of production (that is, keep speed a constant) then 
their relative errors might measure their relative success; or 
conversely, if we could experimentally keep their errors a con- 
stant their resultant speed would yield us a unitary measure of 
their success; that is, that in either of the two alternatives we 
should have not two aspects of success but rather only one; 
and instead of having for each a two-traited profile we would 
have a unitary success score such as our ordinary regression 
and weighting equations demand. 

A little thought will reveal the fact that this is not possible, 
psychologically, for even these two variables; for hold down the 
speed of Miss A and instead of becoming more accurate (as 
almost everyone presupposes will be the result) the converse 
actually may occur; while speed up Miss B and her accuracy 
in all probability will improve; at least the normal expectation 

1 An error, popularly, costs $X to produce it, $X to undo it (e.g., to erase it) 
and $X to do it over again, not to mention $X to locate it (inspection costs). Need- 


less to say, the four X’s are not all equal. But this breakdown of the sources of the 
costs of errors gives some light on why errors are so abhorred by employers. 
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is that, after a minimum of practice, such will occur. It all 
depends! 

Also, speed and accuracy are not the only variables in suc- 
cess. Success is not a two-traited variable compounded only 
of speed and accuracy. If Miss A and Miss B are private secre- 
taries their employer considers as aspects of their merit or worth 
other abilities as well: how well, for example, they can spell, to 
cover up his ineptness in this respect; the excellence of their 
grammar, so that they never say in a letter “have went” even 
though that may be what he dictated into the dictating ma- 
chine; their ability and adroitness in answering the questions 
of visitors and in answering the telephone; and even their 
ability pleasantly and skillfully to entertain an influential cus- 
tomer until the boss’s return. In other words, even in simple 
jobs success is multi-dimensional. Or, finally, we conclude that 
success on even a simple job is measured by the individual’s 
profile on a multi-dimensional profile system composed of m 
criterion variables, where, generally speaking, m is greater 
than 1. 

The upshot of all the argument to date, then, would seem 
to be that for an adequate treatment of the prognostic problem 
we should strive to predict the individual’s success-profile, 
rather than to predict any unitary combination—however skill- 
fully combined—of those criterion variables. 

This prediction of a profile, as vs. a unitary criterion score, 
may be done by the simple expedients of: 

1. Administering enough tests of a wide enough variety to 
predict well (a relative term!) each of the several subportions 
of a criterion—the wages, the foreman’s ratings of indispensa- 
bility, the accuracy and the quantity of production over com- 
parable times, and so on. 

2. Finding all the possible inter-correlations, 

(m+n) -(m+n+1) 





in number, of the m criterion variables and the 1 test variables. 

3. Employing each of the m sub-criterion variables in turn 
as a criterion and determining the appropriate regression equa- 
tion for its prediction by the aid of the m several test variables 
in each case. 
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4. Making m different Y predictions for each examinee by 
means of the m thus determined regression equations; and 
subsequently plotting for him these m predicted criterion scores 
as a profile on a regular profile chart, the plotted profile thus 
being the predicted profile of the person in question.” 

Such a profile, in common with all profiles, reveals visually 
in relief in general three things: 

1. The general trend, or tenor, of the scores. 

2. The strong points and the weak points of the individual’s 
several criterion “abilities.” 

3. The dispersion of the criterion “traits” about their cen- 
tral trend. 

To the extent to which the several sub-portions of a cri- 
terion are highly related (possess a common G, or mathemati- 
cal factor) will the standing in the several traits be more or 
less identical and tend, consequently, to result in a series of 
ratings which are located in a horizontal line. The normated 
entries (the encircled “ovals”) of all columns of the plotted 
profile would be identical if all such predicted sub-criteria inter- 
correlated perfectly. 

The mode of using such profiles—in selection, for example— 
is not generally agreed upon. Consequently we are concerned 
more often than not with combining the several sub-criterion 
variables into a unitary criterion score; for in that case the 
simple concept of a “critical score” applies. 

The success, then, of an individual, about which we prate 
so glibly, is a complex thing, and if it is to be made, artificially, 
into a unitary variable must be compounded of the weighted* 
sum of the several component parts, as one, simplest, concep- 
tion of the matter. If we accept that definition the problems 
become three in number: 


2 Caution: Each such variable, Y,, will have a standard deviation, oy, = 
Ro.123 . .. mO:, and is therefore more constricted than is oy,. Accordingly one will 
need to find by actual computation for each case a Y value and determine the 
distribution of that variable in order to find the decile ranges, the “norms,” for 
example, for the profile sheet of column-1, dealing with aspect-1 of the criterion 
profile. : ae 

_ 3It should be recalled that if we add scores, “using no multiplier,” we really are 
using a gross-score weight of 1, which weights the several variables directly as the 
size of their standard deviations. Consequently if scores are combined they are 
weighted; and it is impossible, then, “not to weight” them. 
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1. To decide what variables are to be included in the cri- 
terion: i.e., to decide what variables are to be weighted other 
than zero. 

2. To decide in what units—comparable measures—to 
record the individual’s standing in the m several sub-variables, 
or component variables, of the criterion. Unless the several 
sub-portions are recorded in comparable scores, the multipliers 
of the scores will weight the variables other than intended. 

3. To determine how these are to be weighted; i.e., to deter- 
mine by what specific weights we shall multiply in turn the 
several sub-criterion scores of John Jones before adding the 
several resultant products to obtain his final “weighted” cri- 
terion score. Let us consider each of the three problems in 
turn. 

The Sub-Criterion Success Variables 


Depending on the problem at hand one will have different 
success scores. Thus in a study of divorce, one would consider 
for inclusion in a criterion such variables as length of time that 
the marriage existed before the divorce, the happiness scores of 
the husband and wife at comparable times, or the sum or aver- 
age of the two as an index or measure of the “happiness of the 
marriage,” the number (and the quality) of the children in 
which the marriage resulted before the divorce (if one is inter- 
ested in the eugenical aspects of divorce) and the like. The 
above statement suggests three corollaries: 

1. That the purpose of the study, whether legal, moral, 
social, eugenical and the like, will in part determine, if not the 
individual variables, at least the fields or realms in which we 
may look for the variables to be included. 

2. That there are always variables which are on the border- 
land of “test” and “criteria”; ones which in one study are on 
one side of the equality sign (of the criterion-weighting equa- 
tion) and in another are on the other. Thus in a study of 
divorce in which the length of the marriage (stability or per- 
sistence of the marriage) is the focus of interest, the number of 
children logically is thought of as being a “test,” the presump- 
tion (or hypothesis to be tested) being that where there are 
children a marriage probably is more stable or lasts longer than 
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where there arenone. (There could be curvilinear relationship 
between number-of-children (X) and the persistence in years 
(Y) of the marriage.) In another study where the center of 
interest is the social productiveness of the marriage, such as the 
number of children, the quality of the home, the wealth, happi- 
ness, inventions and personal adjustments secured, the number 
of children is certainly one of a number of alternative “prod- 
ucts” of the marriages studied and quite clearly, then, is one 
of a number of alternative criterion sub-variables. 

3. That there is no universal agreement as to what consti- 
tutes “success” in even one realm, not to speak of it as a gen- 
eralized measure in all realms of life. Concretely, we do not 
agree even as to “why, or for what traits of success we pay a 
given man one wage and another another.” Obviously, then, 
to say that “Mr. Jones is a successful man” means nothing defi- 
nite statistically; even if, by common agreement, it does mean 
that the man in question has been successful in a financial way, 
probably that he has secured promotions, recognition and even 
acclaim; and possibly, though not surely, that he has lived a 
good and useful life—in other words, it suggests much but says 
nothing definite and positive. Since there are no universally 
agreed upon definitions of success nor even concurrence as to 
what variables to include, we must be arbitrary in any case. 
We may seek to mitigate the possible bad effects of our own 
judgment by pooling the judgments of other (competent) 
people as to what variables to include. 

For our purposes here it will be necessary to restrict the 
discussion to some one concrete realm. The one of greatest 
development to date is the vocational. More criterion varia- 
bles have been devised in this realm, probably, than in all 
others combined. Let us restrict our attention then to it. 

In this realm, then, we may ask ourselves what are the 
evidences of vocational success. Let us enumerate some of 
those which have been employed in such studies, while in order 
to save space we discuss each briefly at the point of mention. 

1. Wages. In free competition, where workers are all doing 
the same kind of work, and promotions “on the basis of merit” 
are promptly granted, wages evidently are a fair to good, if not 











278 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


excellent, aspect of a man’s worth. A little reflection, how- 
ever, will reveal that wages often are paid for quite other 
reasons than “merit displayed in free competition.” One man 
is paid more than another of conceptually equal merit because 
he is a relative of the boss; because he has more dependents and 
“needs it”; because he asks for raises more persistently; because 
the boss thinks well of him, or simply “likes him”; because the 
trade union to which he belongs has a minimum wage scale for 
persons of his “experience”; because he has been employed 
longer, etc., etc. Conversely two men of conceptually unequal 
merit may be paid the same wage because the company pays 
only a few different “day rates”; or because promotions come 
only at stated intervals and wages generally are most highly 
out-of-line immediately preceding a “promotion period”; and 
the like. 

Again wages often nowadays are paid according to a “wage 
formula” which “recognizes” many other factors besides basic 
“productivity.” That is to say a wage, or a part of a wage, 
may be a bonus intended as much to motivate the person along 
certain lines of desired behavior as to reward him for “merit 
already displayed.” Thus tardiness or absence from work may 
be penalized more heavily than the hours or minutes lost would 
warrant in the hope of inculcating “punctuality” and “depend- 
ability.” 

In general, equal wages will reveal equal merit only when 
the attendant circumstances are equal, where, for example, the 
number of days or hours is equal; and, in general, where the 
“risks” are equal. Where the wages are equal but the risk 
unequal, equal wage is not an adequate measure of the relative 
worth of the two men receiving it. Concretely, in the first of 
two sales territories it may be harder to sell goods than in 
another (of different “need,” or demand, and socio-economic 
circumstance). Consequently equal commissions received in 
the two mentioned would indicate greater “sales ability” on the 
part of the man in the “more difficult” territory. The problem 
of saying precisely how much better is not necessarily solved 
by an arbitrarily devised correction. 
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Absences from work, if not allowed for, make the “pay 
envelope” an erroneous reflection of the individual’s merit.‘ 

Anything which arbitrarily reduces the standard deviation 
of the resultant sub-criterion scores has a deleterious, or attenu- 
ating effect on the resulting scores. Thus, in an effort to pre- 
vent “speed up” and consequent possible reduction of wages 
for doing a given amount of work, workers under bonus sys- 
tems often stereotype the output; that is to say, all or all but 
a few do on any one day practically the same amount of work 
despite their ability in individual cases—possibly in a majority 
of cases—to produce much more. Although the record of a 
day, or any number of days individually may show this, the 
summated record of a week or of a month may not be revelatory 
of this situation. Under such conditions obviously the produc- 
tion is all but meaningless. The correlations with production 
of any truly good prognostic tests in this case will be attenu- 
ated, that is, greatly lowered. Where uniform or practically 
uniform day wages are paid, soldiering on the job on the part 
of the more capable workers may have the same end result on 
any measures of quantity of production which may be collected. 

Where workers work in pairs, aiding each other in simul- 
taneous operations, the speed of the one will of necessity be the 
speed of the other, as in all “assembly line” production and any 
criterion, other than “errors,” becomes largely meaningless. 
The respective wages in this case are a measure of the “wage 
formulas” employed and not of individual differences of merit. 

And one may be paid for “power to achieve” rather than 
for “achievement,” because, for example, one can do hard jobs, 
or rare jobs or specialized jobs which another of the same labor 
force cannot. Even though this month both do the same work 
the first of two men may be paid more because he, alone of the 
two, could do the unusual job if one came up; that is, the em- 
ployer is willing to, and does, pay for insurance that his firm 
“can handle all jobs that may come its way.” 

Often also there are notorious biases in wages. In Brazil 
it is said to be the custom for a worker to ask for, and confi- 

“One must still heat the factory even if a certain employee is absent, justifying 


considering somewhere—but not necessarily in wages—this criterion component of 
“dependability.” 
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dently to expect, a raise with the advent of every baby in the 
family, and “ability” in this respect by no means is perfectly 
correlated with job-worth. Again, women teachers, it is com- 
monly believed, often receive far less wages for comparable ser- 
vices rendered than men, the theory, or justification, possibly 
being that “men support families” while “women work for pin 
money” and should be paid accordingly. And of course the old 
law of supply and demand still obtains! In such cases one may 
divide the investigation into two independent studies based on 
ocx.* 

2. Production. Where all the workers of the experimental 
group are working at “exactly the same work,” the quantity of 
work produced is a good measure of the merit of the individual 
man. Production in general is what occupations are for. Yet 
current production, even on excellently planned “bonus work,” 
may reflect only, or largely, irrelevant factors such as a man’s 
temporary need for money, as for paying off a mortgage soon 
to come due; or reflect too largely his current condition as to 
sickness, fatigue, morale, or interest. 

And, critical inquiry often reveals that the conditions of 
work are not equal, that, for instance: 

One salesman has odd jobs to do which subtract from the 
time in which he can give his full attention to his main job of 
selling. 

One salesgirl, as versus another, is required to sell a higher 
proportion of obsolete stock, which consequently sells more 
slowly than the up-to-the-minute merchandise. Or she may 
have alternative tasks which affect her sales effectiveness, as, 
for example, to help keep stock in shape, to assist the buyer, 
and the like. 

One man has a slightly higher gauge of metal so that his 
“poundage” reflects this fact, rather than his “assiduity.” 

Another man’s machine, “just like this one,” runs at a 
slightly higher rate of speed than his neighbor’s so that his 
poundage, and consequently his pay envelope, reflect this fact. 


5 An alternative is to employ sex (the categories of which are arbitrarily quanti- 
fied) as a test variable. Still another is to make some arbitrary correction for sex, 
thus to add to all females’ wages a constant, the difference found by subtracting from 
the mean male wage the mean female wage. There is no unanimity of opinion as to 
what is best to do in such situations. This field of research procedure needs a careful 
scrutiny. We need criteria for deciding such research dilemmas. 
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The “units” of production, although assumed equal, may 
not in fact be equal. In punching 80-column Hollerith cards, 
for example, one girl may punch daily 1000 cards and a second 
the same number. If the first punches full cards (80 columns) 
while the second punches only 40 columns, the “column” mea- 
sure of production reveals the first worker to be approximately 
twice “as good” as the second. 

Even with the units (columns, in the above example) com- 
parable the difficulty (“risk”) of the tasks may be unequal so 
that 80,000 columns, or aggregate holes punched per day, of the 
one may not be equivalent to 80,000 columns of another. The 
one may be punching data requiring coding, which slows the 
task, or may be punching from a rather illegible data medium, 
while the other may have straight numerical punching from 
highly legible and exceedingly conveniently arranged data 
media. And even if both are numerical the one may have a 
data-medium where the entries are hard to locate (requiring 
many backward and forward movements of the eyes) while the 
second has a data-medium in which all the answers align in a 
column at the right so that punching them is all but an auto- 
matic job. 

Time tends to relegate many factors such as those men- 
tioned to the status of chance, or compensating errors. Accord- 
ingly, the longer the time over which criteria are collected the 
less important, generally but not necessarily, are such “errors.” 

In some cases, also, one may so arrange it that there are ade- 
quate controls of such matters, that, for example, half of the 
subjects work for one week on the one project while the other 
half work on the other, and the following week the groups inter- 
change places, while some multiple of two weeks’ wages is taken 
as the criterion variable. 

It would appear, then, that production is likely to be a 
better criterion, if the attendant circumstances of its accumu- 
lation are under the control and supervision of the research 
worker; if, in a word, he collects the data himself, rather than 
taking uncritically what may be handed to him by a production 
officer or records clerk. Only a very few firms now collect 
routinely multiple, and reasonably comparable, aspects of the 
success of their workers on various variables. 
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3. Quality of Work. The quality of work done by a 
“worker” is everywhere highly regarded as an important ab- 
stract virtue. In some cases, such as the performance of the 
stage, screen, opera, studio, prize-fight ring, and operating 
table, it is the paramount consideration. Yet, like the former 
variables, it too has its difficulties. 

In bonus schemes bad quality of work frequently is dis- 
counted, and oftentimes more heavily perhaps than the “psy- 
chological” merits of the case would warrant. Such penalties 
often are “set high” as a means of discouraging their occurrence. 
Sometimes they are based on a quasi-rational principle as “the 
estimated time, often generous, which it would take to undo the 
incorrect or erroneous work and do it over again correctly or 
‘errorlessly.’ ” 

Quality in most cases is based essentially on subjective 
human judgment. Limit gages sometimes give a highly objec- 
tive status to quality measurements where the precise size of a 
product is an important desideratum in the finished product 
repetitiously produced. Even here the man whose average 
error is .0002 usually receives no more wage, honor, merit, con- 
sideration, or even notice than he whose average error is .0005 
when .001 is the allowable maximum deviation of the product 
from the ideal or perfect measurement. One remedy for this is 
to devise measuring machines which reveal instantaneously, 
when a finished product is brought between the jaws of a mea- 
suring device and a constant pressure automatically is applied 
by means of a friction wheel, the amount of error on each piece, 
and its sign. This yields a distribution of the individual’s 
actual errors the mean of which may be taken as his “inaccu- 
racy.” Such a distribution customarily 1s not routinely col- 
lected by industry. It consequently can be had only by special 
inquiry—involving the specially designed machinery—intro- 
duced for that purpose. 

An objective “rating scale” may be constructed to measure 
quality. Thus by scaling cookies as to taste, and comparing the 
taste of each experimentally produced cookie with the taste of 
scaled samples one may ascribe numerical magnitudes to the 
taste of samples of cookies, and so assign the merit of the several 
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cooks thereof, even though the scale be consumed in the 
process! Sewing, soldering, cookie-cutter, wire-splicing and 
lettering scales have been used and, when properly used,’ yield 
objectivity and probably also corresponding validity, in their 
respective realms. 

One may sample the products at periodic intervals, by 
noting, for example, the errors made in successive intervals, 
collected at any time in hourly intervals by the inspector on his 
rounds in successive, frequent, but unanticipated intervals 
throughout the work day. This has been called the timed 
sampling technique. Errors in this may arise if the worker 
strives, consciously or otherwise, to do a better quality of work 
for a limited time when he suspects an inspection is “about 
due.” 

The best measures of quality result from those situations, 
in repetitive production, where all errors above an allowable 
minimum are automatically detected by objective inspections 
(e.g., by the electric eye). In this case the poundage of dis- 
carded work is a significant measure of the quality of work 
done. 

Breakage and spoilage of raw materials or of partly finished 
products to which the worker is fitting additional parts are 
important aspects of a man’s worth and so often are elements 
of a criterion. In assembly or bench work often these consti- 
tute the core of the criterion. To take account of different 
speeds of assembly of different workers one may take as his 
index of breakage or spoilage the percentage (or proportion) 
which the wasted material is of the total material processed. 

Clerks who otherwise are ideal employees may fail miserably 
in their employer’s eyes if they are highly susceptible to making 
many clerical errors, illegibilities of writing, or errors in compu- 
tation, particularly the undetected ones which cost the em- 
ployer money or require him to adjust matters with irate 
customers. 

4. The Rate (or, alternatively, the amount) of Acquisition 
of New Skills. In aptitude tests, generally, the length of time 

6 It is often overlooked that if it takes m judges to determine an objective scale 


it may also take almost as many judges reliably to ascribe scores to individual prod- 
ucts by its aid. 
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taken by persons, equally unskilled at the beginning, to achieve 
equal competence is a rather valid criterion measure. This pre- 
supposes “equal opportunity to learn.” It stresses the over- 
head aspects of employee worth. In many jobs the training is 
costly. In war-time, for example, delays in production—due to 
training time, or any other cause—are scanned more closely 
than usual. Other things equal, that employee who costs least 
to obtain, to train and to maintain, is the most valuable. And 
if promotion is contemplated, the fast learners are those most 
worth retaining. There is at least a fair presumption that they 
will be rapid learners on the subsequent job employing a good 
core of the same kinds of skills. 

This criterion may be applied either to apprentices in a for- 
mal training school or to workers in training on the job itself. 
In fact, as will be shown below, it is sometimes possible to so 
arrange it that the job itself becomes a prognostic test with 
generally good all-round results. 

5. Supervisor's Judgments. These may be of “over-all pro- 
ficiency” and may be employed in lieu of a more satisfactory 
objective criterion; or they may be ratings, say, of the several 
specialized ends or objectives of the work done, of which it is 
generally claimed, or claimed for the present, in lieu of objec- 
tive scales thereof, that only the foreman or other supervisory 
officer is an adequate judge. 

These often are notoriously unreliable in the technical, as 
vs. the popular, sense. The judgment of a foreman in an eyelet 
department of a certain brass factory agreed with those of his 
assistant as to the over-all merit of his men to the extent of only 
.60, even though aided by a formal card-selection-into-piles 
method of rating. He agreed only slightly better with himself 
on a subsequent occasion; that is to say the self-correlation was 
but little better. The antidote for such, of course, is repeated 
ratings, monthly or bi-weekly or even weekly ratings over a 
considerable period of time, with all the resultant ratings aggre- 
gated into a composite, summated or averaged, score. By this 
expedient the composite rating may be made to approach as 
near 1.00 in reliability as we wish while its validity may be 
expected to obey the Brown-Spearman prophecy formula; that 
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is to say, may be expected also to increase somewhat but to 
approach rapidly, with increase in the number of ratings, a fixed 
ceiling which generally is far short of 1.00. The crucial ques- 
tions here are: (a) just what “traits” shall be so rated, and 
under what circumstances? and (b) how many ratings should 
be secured, and from whom? 

The rating of the products of artists’ work, pictures, etch- 
ings, and the like, are notoriously unreliable, intercorrelating 
no higher, perhaps, than .30. Even the reliability of the much- 
vaunted medical skill and judgment in diagnosis is probably 
no higher than .50. In the latter case Brown’s formula tells us 
that to secure a composite judgment with a reliability of .95— 
such as any group intelligence test, in its field, will yield on a 
2-hour examination or thereabouts—not to mention validity, 
would require the independent judgments of an assemblage of 
some nineteen equally competent doctors. It follows that (a) 
often one cannot assemble enough “judges” who know well 
enough the subjects of the investigation to rate them accu- 
rately; and (b) that one often cannot, in many fields, obtain 
judgments or ratings “worth their salt.” 

The reports of a training supervisor may be of more worth 
than of a regular line officer. Thus the worth of an assistant 
foreman, for purposes of rating job-worth of employees, may be 
greater than that of his superior, the foreman-in-charge, par- 
ticularly if the former supervises the men in their day-to-day 
work in the shop, while the latter concerns himself largely, or 
at least more than the former, with the general planning of the 
work and the paper-work of the department. In fact the latter, 
under this circumstance, may be all too much influenced in his 
ratings by his knowledge of the wages which they receive and 
all too little by their more pertinent behaviors under varying 
shop conditions. The size of the pay envelopes or the exalted- 
ness of the titles of the supervisors does not necessarily correlate 
highly, or even positively, with their “ability to judge.” The 
correlations could be negative! 

If the foreman believes that “experience makes the man,” 
his ratings or rankings will be substantially a ranking of the 
men by age, or by experience. And age, or experience, in this 
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case will correlate more highly with his ratings than any other 
predictive variable. Since age, or experience, in most occupa- 
tions does not correlate highly with “competence” it follows 
that this attitude leads to spurious or attenuated ratings the 
badness of which will not be readily apparent. 

It is notoriously true that supervisors of teachers are not 
able to rate reliably—not to mention validly—the teachers 
under their direction on even a fairly objective trait,’ such as 
the teacher’s “ability to gain and hold the attention of the 
class.” Consequently we must believe that any generalized 
objective such as the “teachers’ teaching effectiveness” cannot 
be reliably rated, even with the aid of formal rating scales, with- 
out many repeated visits for observation of the actual teaching. 
The ratings actually received often are a composite of the 
rumors about a teacher seasoned only with a dash of real knowl- 
edge thereof. 

The situation in school or industry in which more than 
two—or even two—persons know intimately all those under 
their direction is rare indeed. Two ratings by different persons 
are probably more valid than the same number, two, of re- 
peated ratings by one judge; but if the former cannot be ob- 
tained the latter is about the only alternative. 

One thought here, however, where the number of judges 
of necessity is inadequate, is to have fellow-workers, fellow- 
apprentices, fellow-teachers, or fellow-pupils, say, rate one an- 
other. The extra number of raters may either partly or fully 
compensate for any assumed “lack of experience in (or compe- 
tence of) judgment.” Such “judges” indeed are able, for exam- 
ple, to note certain aspects of competence not so readily de- 
tected by the supervisory officer. Workmen often know, for 
example, when the shoddy work of a fellow-worker is being 
“buried” by an overdose of a beneficently-hiding coat of paint, 
like the doctor’s mistakes by the casket! 

One might even weight more heavily the judgments ren- 
dered by the adjudged better workers than those of the poorer;*® 

7 Bowman, Earle. A Plan for Evaluating Teaching in Terms of Pupil Activi- 
ties. Unpublished Dissertation. Ohio State University Library, 1928. Pp. 185. 


8 This presupposes a positive and “high” correlation between ability to do and 
ability to judge. Several studies have in fact shown that this is the case; that, 
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but if “everybody judges everybody”—under conditions where 
nearly everybody knows everybody well—it is a fair guess that 
the additional returns from such weighting generally may not 
pay for the additional time and trouble necessary. 

6. Knowledge. Achievement tests and trade tests of knowl- 
edge possessed may be employed as a criterion sub-variable. 
There is more truth than fiction in the standing joke that the 
eminent surgeon’s bill of $200 for removing your appendix 
really is $25 for “removing your appendix” and $175 for his 
“knowledge of what to do and what not to do.” Popular 
opinion has it that there is an almost zero correlation between 
knowledge and skill. Actually in most trades the two correlate 
quite highly; so highly in fact that oral trade tests (measuring 
knowledge primarily) are excellent tests for picking men in 
order of their ability “to perform” on the job. 

7. Job Tenure. This is comparable if the employees, say, 
all entered the firm at the same time and so have a good chance 
of being subjected to the same (historical) factors and condi- 
tions. This variable is only slightly related to “competence.” 
It probably is highly related to that aspect of job success known 

“job satisfaction.” It would not be comparable if business 
cycles, wars, or other “disturbances” did not “cover” all em- 
ployees’ records equally. 

Following general reasoning that that profile is most stable 
(statistically reliable)® and most representative of the person, 
other things equal, which is most unique (in the factor sense) 
it would seem that it could be concluded that one should aim 
in predicting profiles to include as many unique variables as 
possible, while if one is combining them into specific sub-cri- 





indeed, this is a point which education thus far has but little recognized or exploited, 
namely that at a certain stage in the progress of learning one may possibly secure 
more progress by having the learners rate the products of their fellow trainees rather 
than do another construction project. See Smith, R. E. and Toops, Herbert A. “An 
oo ae gee in Self-Rating Shop Products.” Industrial Arts Magazine, XV (1926), 

Rating, it may be assumed, develops the critical faculties of the workman; 
and after a modicum of skiil at performance has been obtained the ability to detect 
the faults of one’s production—as a result of the training in observing the faults of 
the work of others—obviously may be the royal path to their early and subsequent 
elimination in one’s own work. 

® Edgerton, H. A., Bordin, E., and Molish, H. “Some Statistical Aspects of 
Profile Records.” Journal of Educational Psychology, XXXIII (1941), 185-196. 
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terion variables, such as “accuracy,” then the variables included 
should be as “loaded” with the sub-criterion “factor” (here 
“accuracy”) as possible. In the second case, the accuracy 
“tests” should be highly intercorrelated rather than the con- 
verse. 

8. Supervisory and Leadership Ability. The normal prog- 
ress of a man in industry—and particularly so in war times— 
is to be given supervisory responsibility as fast as he demon- 
strates an ability to assume it. Where this is the case the 
Supervisory status of a man is an important criterion element. 
The foreman is distinguishable from the workman often, or 
even usually, not so much by “superior ability to do” as by 
superior knowledge, better judgment, greater ability to solve 
dificult practical and theoretical problems, ability to lay out 
work, to size-up repair jobs, to criticize faulty work, and to 
induce men cheerfully and uncomplainingly to work indus- 
triously. 

The number of workers supervised is probably of far less 
moment than the adequacy, or quality, of the leadership ren- 
dered. This conceivably might be measured by some form of 
check list. 

9. Incidental Factors. 

(a) Amount of supervision required. 

(b) Attitude toward supervision. 

(c) Job satisfaction. 

(d) Adjustment to the work and to fellow-workers. 

(e) “Influence” on fellow-workers (e.g., morale-building 
influence). 

(f) Skills inventory, for example a measure of the “cor- 
rectness of motions,” the extent of possession of such a set of 
motions that present efficiency is likely not only to persist but 
also to increase (improve with practice). Does this auto driver 
have the right “driving habits” as measured by a check list? 
Does this salesgirl have the right sales tactics, as ascertained by 
a professional shopper? Does this professional swimmer have 
the “logically correct swimming movements”? 

Viteles points out that success is not speed plus accuracy 
plus job tenure; but that it is an integrated whole. The diffi- 
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culty of explaining an n-dimensional whole in quantitative 
terms requires as the only alternative to combining the several 
scores, as by weighting, for example, that one treat the success 
as a profile’® as above indicated. 

Bingham and Freyd have pointed out that objective criteria 
are more likely to correlate with objective tests, while ratings 
are more likely to correlate highly with tests of personality. 
If so, ratings, often hitherto employed as criteria, no doubt 
have perpetuated some tests which ought to have died a natu- 
ral death long ago. 

A criterion that is objective and uninfluenced by human 
judgment is usually “a more predictable measure” of job pro- 
ficiency than a rating based entirely upon supervisors’ opinions 
of the workers’ performance. 

A few general rules applying more or less to all such cri- 
terion elements may not be out of place: 

It is important to make provision for the orderly and syste- 
matic collection of the criterion first, that is, ahead of adminis- 
tering the tests; otherwise the subjects may “get away” from 
one (by the route of graduation or drop-outs of students, the 
transfer of workers, the shipment overseas of soldiers, and the 
like), whereupon one will find that he has a fine bunch of ex- 
pensively collected and scored tests papers, but no criterion to 
tell him whether they are worth anything or not. 

The test scores cannot of themselves tell one whether they 
are of any worth. One of course must have a wide dispersion 
of test scores, but that of course is a fairly common character- 
istic of worthless tests. The test must be fairly difficult, ideally 
of such a difficulty that the average score is about half the pos- 
sible maximum score, but that also is a characteristic of many 
worthless tests. It must have some reliability, but, in the ex- 
perimental stages and other things equal, the reliability had 
better be low rather than high to the end that its validity as 
well as its reliability may be greatly improved by lengthening 
it. Among existing tests generally—and possibly even among 


tests of a particular type as well—there is no correlation of note 


10 Even then profiles which may be adequately denoted by code numbers (from 
addends) can scarcely be apprehended in entirety. 
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between reliability and validity. Accordingly high reliability 
in itself warrants nothing as to the value of a test. And finally, 
even if a test has a large dispersion, the right difficulty and a 
good to high reliability—all three—even that pattern of suppo- 
sitions merits nothing as to the value of a test. Validity may 
be established only by a validity coefficient, in turn determina- 
ble only by the aid of a criterion. 

Much time and care should be expended on the collection 
of adequate variables. The final test scale or prognostic battery 
probably will be no more valid than the yardstick by which it 
was constructed. And at best it can be no more valid than the 
square root of its reliability coefficient. This conclusion is 
achieved by manipulation of the formula for the attenuation 
coefficient. 

Possibly as much time should be spent in devising the cri- 
terion as in constructing and perfecting the tests. This impor- 
tant part of a research seldom receives half the time or attention 
it requires or deserves. If the criterion is slighted the time 
spent on the tests is, by so much, largely wasted. 

All criterion variables should be collected over comparable 
times. The measurement of improvement, for example, in gen- 
eral should start at comparable periods in the several persons’ 
individual learning or growth curves. 

Criterion variables which accrue in time should represent 
the same amount of time sampling. It is better, for example, 
to include in a study all freshmen who persist to the end of the 
year, taking as a criterion score for each his end-of-the-fresh- 
man year scholarship (to one definite point in time), than to 
include also therein the averaged final success of those who 
dropped out with only one quarter (one-third of a year) of 
college work and also the final success of those additional ones 
who dropped out with only two quarters (two-thirds of a year) 
of college work and so on. The three populations mentioned 
have over-all marks which have vastly different reliability 
coefficients, say .70, .824, and .875 ( Brown’s reliability formula) 
respectively, and considerably different validity coefficients. 
This is partly in the interest of rendering comparable the re- 
ports, for example, the validity coefficients, regression equa- 
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tions, and the like. If one employs the maximally heterogene- 
ous group, the validity found by another who uses the same 
tests, otherwise comparable, would be exactly comparable only 
if he had the same proportions of the three sub-populations 
respectively. 

If data are missing in only a small portion of the cases and 
in only a few of the criterion variables these may be and per- 
haps ordinarily should be supplied before beginning to combine 
the data. Toops™ has developed a formula for this. The for- 
mula assumes that for any missing score in a given variable one 
should supply the weighted average standard score of the per- 
son in question on those variables in which the data are avail- 
able. If the variables are highly intercorrelated, and few scores 
are missing, the assumption is justified. If a small per cent, 
preferably not over one per cent, of the persons cannot be rated, 
or the ratings (rating slips, for example) are lost, and there is 
reason to believe that these are a random selection of the total, 
then the data-present cases may all be reduced to centile ranks 
on the basis of the available N’s, while the missing-data cases 
may be assigned the rank 50.” To facilitate such “transmu- 
tation” on a large scale one may employ Hull’s Tables.** 

One may in some cases devise a mechanical device to count 
the success of workers automatically, possibly even unknownst 
tothem. Thus a Veeder Counter may be so attached to a type- 
writer or a Hollerith punch that it counts every key-stroke 
produced. If the one machine is employed daily and continu- 
ously by one operator, and one operator only, the total strokes 
of any one day, week, or month, as the case may be, is equal to 
the close-of-business reading less the reading-at-the-beginning- 
of-the-time-interval. Over long periods of time even errors, 
such as the use of the machine occasionally by an executive 
after hours to type a brief letter and the like, tend to iron out. 
If such machines are taken over by substitutes when the em- 
ployee in question is ill or absent this may introduce appreciable 





11 Toops, Herbert A. “The Selection of Graduate Assistants.” Personnel 
Journal, VI (1928), 470-471. 

ee half by chance should be assigned the rank 50 and half the 
rank 51. 

13 Hull, Clark. Aptitude Testing, Appendix 1. Yonkers: World Book Co., 
1928. Pp. 491-492. 
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errors into the resulting scores; but less than would be produced 
by absence and non-use of the machine by anyone if the days 
not worked are not allowed for. 

Most systems of allowances for errors are not particularly 
psychological. By means of addends the patterns of errors (or 
of correct responses) may be recorded** and later may be sub- 
jected to alternative modes of analysis. To all intents and 
purposes the addend code number is the original performance. 

Rate-setters and time-setters have many ridiculous customs 
and “allowances” as well as wise and wiser ones. Their “stand- 
ard time,” may be the shortest of rough guesses rather than the 
result of a painstaking analysis. They often are overly influ- 
enced by a “star (99 percentile) performance” rather than a 
statistically determined average performance. 

In criterion-building we often imply the equality of numer- 
ous “things” which in fact never are equal: 

i. That the motivation of the subjects is equal. This is a 
particularly important consideration. 

2. That the “risk” is equal. We here are confronted with 
such questions as: Do people ride cabs equally frequently in 
this cab driver’s “district” as in others? The distance traveled 
or the fares collected may not be an adequate measure if that 
is not the case. If the houses are farther apart (as in the newly- 
built-up sections of town) the potential riders per mile are 
fewer but the actual riders per mile may be more. 

Is work available at all times so that this machine tender’s 
wages are free from error by reason of his pay envelope not 
reflecting, or reflecting inadequately, “lost time”? 

3. That the experience, or practice, or education, is equal. 

4. That the human factor, as vs. machine factors, is the one 
which produces the basic variation in criterion scores of the 
various individuals of the criterion group. 

5. That the persons employed for a criterion may allowably 
be combined in a society, rather than properly only be divided 
into two or more. We have mentioned sex above. The same 
can be said for age; and in the case of machine-tenders possibly 
the age of their machines. 

14 Toops, Herbert A. “Code Numbers as a Means of Scoring Group-Adminis- 


—— Test Products.” Journal of Applied Psychology, XXVI (1942), 
13 ; 
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6. That the work environment is equal. Ventilation, noise, 
lighting and other conditions all have differential effects on 
individual workers which in some instances amount to biased 
errors, or worse. Racial or religious antagonisms sometimes 
divide any industrial department into “good” and “poor” pro- 
ducers. 

7. That the progress of work is not impeded by hindering 
factors, such as unavailability of raw materials, transportation, 
power or supplies over which the employee has no control. 


Comparable Scores 


Before the several portions of a criterion to be combined are 
weighted, each person’s X-score in each variable must be re- 
duced to comparable scores. For this purpose comparable 
scores may be defined as any scores that have equal variability. 
Since “success” at best is an arbitrary variable, one need not 
involve in the definition the requirement of equal means in the 
several variables. If this is the definition adopted, then rank- 
ings (based in common on all N persons in each different cri- 
terion variable, with “tied scores” outlawed) are comparable 


xX, 
scores; and so are gross standard scores or Z’s (2, = =“), and 
1 


also A.D. units, median deviation units, and the like. 
Because of their highly desirable algebraic properties, the 
ability to predict exactly standard errors of estimate, for exam- 
ple, if they are used, practice restricts comparable scores to 
standard scores or some variations thereof, such as Z’s, or 
T-scores, and the like, all of which in common have equal 


oe x 
standard deviations. Standard scores, z1= —+, are most fre- 
O1 


quently employed and have a property, sometimes valuable, 
that all the criterion variables, automatically by its use, are 
made to have equal means as well as equal standard deviations. 


The Arbitrary Weighting Formula 


The arbitrary weighting formula for combining the m sev- 
eral sub-portions of a criterion into a criterion variable, or the 
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several sub-criterion variables into a unitary criterion score, 
accordingly is: 


X w ¢ X. 
y,-/ 1, Be . . Pe m (1) 

O1 Og Om 
where X,, X., . . . , Xm represent the m several sub-portions 
to be summated, and B,, B., . . . , 8, are arbitrary weights, not 


precluding “logical” weights, to be accorded the m several por- 
tions, or variables, respectively. 

If one is going to predict a profile instead of a unitary cri- 
terion score, one will naturally desire to secure as adequate an 
“accuracy score” or “speed score,” for example—one of the 
basic variables of the profile—as possible. It follows that 
possibly one will procure several different accuracy scores for 
each individual, e.g., his September accuracy score, his October 
accuracy score, and his December accuracy score. In this 
event, then, we shall need to combine two or more scores (in 
the above case three) for each “variable” (such as speed, accu- 
racy, bonus earnings, etc.) available. Consequently in this 
case we still will have the problem of combining scores, and the 
above formula with corresponding adaptation still is appro- 
priate. 

Thus we cannot escape the “weighting” dilemma. Even if 
we have only one criterion score, we still would weight the 
variable with B, = 1, in the bids system below. 


Reversing Signs of Sub-Variables 


If one is combining “accuracy” scores and “error” scores one 
may reverse the signs of the “error” scores so that they will 
combine properly with the former. This can be done either 
by giving a negative sign to the f’s or to the error scores, in the 
bids system below. One of the best systems is to transmute 
all the original scores, X,’s, to transmuted scores on such a vari- 
able by the formula, 


X,=K,-B,X,, (2) 
where 3, is the bids-importance of the trait, a positive magni- 
tude, and K, is taken of such a magnitude that all resulting 
X,’-scores are positive. It follows that the resulting X,’’s will 
all be positive but that the less meritorious (e.g., most error- 
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producing persons) will have the smallest-sized scores and vice 
versa. 


The Bids System of Arbitrary Weights for the 
Portions of a Criterion 

The several arbitrary weights B,, B., Bs, . . . , B», of formula 
(1) most conveniently can be obtained by the bids system pro- 
posed by T. L. Kelley for weighting factors-of-merit in rating 
the §.A.T.C. in World War I. In essence it involves the fol- 
lowing elements: 

a. N judges, at least above a minimum of competence, 
ascribe to each of m criterion variables bids-of-relative-impor- 
tance, in respect to their judged importance, under the con- 
trolling principle that the sum of the bids allotted shall exactly 
equal a certain predetermined quantity, say 100. ‘The situ- 
ation is thus rendered as close as may be to that of, “all other 
things equal, what per cent of the total importance (variance) 
can be ascribed to a deviation of 12, in this variable of concern.” 
It will be noted that this is one of the central points in our 
notions of variance. Consequently it is to be presumed that 
statisticians who have dealt with this specific problem, after an 
ade quate on-the-job study of some weeks or months, may be 
more capable of doing what is statistically demanded than the 
run-of-the-mine foreman or supervisor who is about the only 
available other source of “expert judges.” 

b. After the judges have rendered independently their ver- 
dicts of bids on the several traits, preferably in integral num- 
bers adding to 100, one may weight all the judgments of a given 
judge, if desired, according to his assumed competence-at-judg- 
ing. Thus one could weight the judgments as the square root, 
or cube root, say, of the judge’s years of experience in super- 
vision up to some maximum—say 20 years. If several expert 
judges are available the simpler procedure is to assume that all 
are equally competent, and accordingly by merely adding the 
scores one in a sense weights them equally. Thus in every 


15 Since they all summate to 100, this it would seem tends in part at least to 
weight the several judges equally. Of course the weights may add to a predeter- 
mined sum and yet the standard deviations of the bids may vary considerably, par- 
ticularly in the case of erratic judges. Where scores are added that judge who has 
the largest standard deviation, of course, receives most weight. 
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case we obtain a weighted (summated) criterion score for each 
individual. 

c. A compromise set of bids is arrived at, possibly the 
rounded averages of the several judges’ bids on a given trait. 
No harm is done, and some gain in interpretation is secured, if 
these are so rounded that their sum also is 100. 

d. The compromise bids thus secured are the values of {,, 
Bo, Bs, . - - » Bm Of formula (1). 

The judges should pay no attention to the signs of the B’s, 
as in the case of “accuracy” and “errors” above, but should 
judge the traits as-if-they-were-of-absolute-sign. 

Points to be noted in the ascription of the bids” are: 

(1) Variables repeated in other variables (having a high 
correlation with other variables) should receive a low weight. 

(2) Variables which are subject to error, other things being 
equal, should receive lower weights than those not so subject 
to error. 

(3) Variables representing an adequate sampling over a 
long period of time should receive a high weight, other things 
being equal, relative to a variable which is a sampling of only 
a short performance of a similar trait. 

(4) The bids should be made up independently, i.e., with- 
out the judges conferring during the original distribution of 
bids. 

(5) After each of two or more judges independently has 
distributed 100 bids to the m traits, as much revision, before 
comparing results, as desired should be allowed. 

(6) After the bids are thus secured, a set of compromise 
bids should be made up. These likewise should add up to 100. 
In this construction of the compromise bid, the above first three 
principles, and all other pertinent ones, should be invoked in 
the discussion, to insure that a fair set of bids result. It may 
be found, for instance, that two ideals clash; the use of assis- 
tantships, etc., as rewards for the encouragement of good 
scholarship, and their use to obtain cheap labor for departmen- 
tal routine duties, for example. It may be found necessary to 


16 Toops, Herbert A. “The Selection of Graduate Assistants.” The Personnel 
Journal, VI (1928), 457-472. 
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make some distinction between these two; even to the extent 
of establishing two sets of weights, and evaluating each candi- 
date’s merits from the two points of view, successively. It 
would seem to be good departmental policy, in that event, to 
evaluate for both appointments every candidate applying for 
either of the two types of appointments, irrespective of the fact 
that the candidate applied for only the one appointment. 
Otherwise, from the department’s point of view a capable assist- 
ant, for example, may be lost if there are not enough scholar- 
ships and fellowships “to go round”; while from the student’s 
point of view, often he will find the alternative position offered 
him to be very acceptable. 











PERSONALITY AND INTEREST FACTORS IN 
DENTAL SCHOOL SUCCESS 


CLAUDE EDWARD THOMPSON 


Northwestern University 


Tue purpose of this study was to determine whether or not 
certain criteria of success in Dental School are significantly 
related to scores on personality and interest scales. Previous 
research by the writer (15) on the relationships between motor 
and mechanical abilities and success in Dental School required 
that he interview personally over one hundred practicing den- 
tists and the faculty of a College of Dentistry. These men 
were almost unanimous in claiming that personality and inter- 
est factors are of as much importance as aptitudes in determin- 
ing success in dentistry. 

Published reports of research and reviews of the status of 
selection and counseling techniques in dental schools (1, 2, 4, 
5, 6, 7, 8, 12, 13, 14) indicate also that measures of personality 
and interest would be of value in selecting and counseling. The 
Strong Vocational Interest Blank for Men is being used as a 
predictor item for success in dentistry at the University of 
Maryland. However, no published reports are yet available 
to reveal relationships between scores on the interest scale and 
criteria of success in this school. 


Test Used 
During 1942-43, three tests were administered to students 
in the College of Dentistry at Northwestern University.’ These 
tests were: 
(1) Preference Record—Form AS, by G. Frederic Kuder 
(3, 9, 10, 17). 
1The writer is indebted to Harold A. Graver, Director of Admissions of the 


College of Dentistry, Northwestern University, for assistance in the test adminis- 
tration. 
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(2) California Test of Personality—Adult Series, devised 
by Ernest W. Tiegs, Welles W. Clark, and Louis P. Thorpe 
(16). 

(3) MacQuarrie Test for Mechanical Ability, by T. W. 

MacQuarrie (11). 
The MacQuarrie Test for Mechanical Ability and the California 
Test of Personality were administered to 158 freshmen. The 
MacQuarrie, the California Test of Personality, and the Kuder 
Preference Record were administered to 66 seniors. 


Criteria 


Previous studies (7, 8) at the College of Dentistry, North- 
western University, indicated that the MacQuarrie was a usable 
test for predicting freshman year average grade. Following a 
suggestion in those studies, an attempt was made in the present 
study to refine criteria for test evaluation by separating theo- 
retical and technique grades from practicum grades. This was 
done by obtaining cumulative points earned in technique and 
theory and cumulative points earned on product or work done. 
The scores were arranged in ascending order and divided into 
deciles. It was then possible to score anywhere from 1 to 10 in 
both kinds of grades. 

Results 


Table 1 presents the correlations obtained for the 158 fresh- 


TABLE 1 


Correlations of Test and Criterion Scores for 158 Freshmen 








Theory and Technique Practicum 





MacQuarrie Total Score ............ 05 ll 
Self-Adjustment Total ............. — .04 — .09 
Social-Adjustment Total ............ .08 .20* 
Total Adjustment Score ............ Az .20* 





* Correlations approaching statistical significance are indicated by asterisks in 
all tables. 


men between theory and technique and practicum criterion 
scores and scores on the two tests. The California Test of Per- 
sonality gives Self-Adjustment, Social-Adjustment, and Total 
Adjustment scores. These were correlated separately. It 
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should be pointed out that the California Personality Test (16) 
has not been validated against objective outside criteria. The 
180 items were evaluated in the following manner: 

(1) Judgments of teachers, principals, test experts, person- 
nel directors, and employers as to whether or not each item was 
an indicator of adjustment and employability. 

(2) The reactions of employed adults as to whether or not 
they judged each item to be an essential characteristic of a suc- 
cessful employee. 

(3) The extent to which the results of the test agreed with 
the known characteristics of particular adults. 

(4) The extent to which each item was consistent with the 
score on the test as a whole (bi-serial r). 

Due to the questionable reliabilities of the six components 
in each of the categories Self-Adjustment and Social-Adjust- 
ment, it was decided to use only the over-all scores for pur- 
poses of group comparison. The split-halves reliabilities of 
these totals are: 


Sec. 1. Self-Adjustment ........... 888 
Sec. 2. Social-Adjustment ......... 898 
Total Adjustment ......... 918 


Only two of the correlations approach statistical signifi- 
cance. For the 158 freshmen there is no relationship between 
total score on the MacQuarrie and standings in the criteria. 
There is positive but low correlation between Practicum cri- 
terion scores and Social-Adjustment and Total Adjustment 
scores on the California Test of Personality. 

Table 2 presents the correlations between criterion scores 
and test scores for 66 seniors. Only the three of nine compo- 
nents of the Kuder Preference Record on which seniors averaged 
definitely above the norms for this scale were correlated with 
the criterion scores. 

For both freshmen and seniors (Tables 1 and 2) the Mac- 
Quarrie total scores have little or no relationship to either Theo- 
retical and Technique or Practicum criterion scores. It is possi- 
ble that the first three sub-tests of the MacQuarrie measure 
motor skills and the last four sub-tests measure mechanical 
ability. Using total scores might, therefore, obscure relation- 
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TABLE 2 


Correlations of Test and Criterion Scores for 66 Seniors 








Theory and Technique Practicum 





MacQuarrie Total Score ............ 17 13 
Self-Adjustment Total ............. PA ig .26* 
Social-Adjustment Total ........... 20* a 
Total Adjustment Score ............ 22” .26* 
i a ep ee -.10 — .06 
Aer eeeerrne .28* All 
ROMITEMWESE 6 occa cum asad oes -.01 .24* 





ships of sub-tests to criterion scores. Total scores were com- 
puted separately for the Tracing, Dotting, and Tapping tests 
and for the Copying, Location, Blocks, and Pursuit tests. 
These sub-total scores were then correlated with the criteria. 
These correlations are presented in Tables 3 and 4. 


TABLE 3 
Correlations of MacQuarrie Sub-Total Scores with Criteria for 158 Freshmen 








Theory and Technique Practicum 





Tracing, Tapping, Dotting ......... -.17 a2” 
Copying, Location, Blocks, Pursuit .. 16 = 2s" 





It can be seen that consistently, for both freshmen and 
seniors, Tracing, Tapping, and Dotting scores correlate nega- 
tively with Theory and Technique and positively with Practi- 
cum scores, and Copying, Location, Blocks, and Pursuit corre- 
late positively with Theory and Technique and negatively with 
Practicum scores. Five of these eight correlations approach 
statistical significance. If the first three sub-tests measure 
motor dexterity and the last four sub-tests measure perceptual 
or visualizing abilities, Theory and Technique scores could be 
expected to correlate with the last four sub-tests and Practicum 
scores could be expected to correlate with the first three sub- 


TABLE 4 
Correlations of MacQuarrie Sub-Total Scores with Criteria for 66 Seniors 








Theory and Technique Practicum 





Tracing, Tapping, Dotting ......... -.19 32" 
Copying, Location, Blocks, Pursuit .. a2” —.27* 
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tests. This expectation receives support from the positive and 
negative sign directions of the obtained correlations, though 
such breakdown does not provide correlations of sufficient mag- 
nitude for use in individual prediction. All correlations (Table 
2) between scores for seniors in Self-Adjustment, Social-Adjust- 
ment, and Total Adjustment scores on the California Test of 
Personality and Theory and Technique and Practicum criterion 
scores are positive but low. The correlations between Practi- 
cum criterion scores and Self-Adjustment, Social-Adjustment, 
and Total Adjustment scores approach statistical significance. 
The correlations between Self-Adjustment, Social-Adjustment, 
and Total Adjustment scores and Theory and Technique ap- 
proach statistical significance. These findings indicate rather 


TABLE 5 


Average Percentile Standings of 66 Seniors in Components of the 
Kuder Preference Record 














Components Average Percentile Obtained 
LL | a eee 91 
Computational ................. 50 
SS er eer 93 
RRNOI Bd ci5.<iars <a-cladiciew wae aia 30 
TWN ais. 6's 55S 4 oo bce assis 45 
EE EET SP e 21 
_ eee errs. 25 
NOCIAL DEFVICE 2... 6c sc ccc ccs ccece 67 
RGM 5s loss Sis a. sala eile ate eas 8 





consistent relationships between what is measured by the test 
and standings in the criteria. However, these correlations 
would not be useful for individual predictions. 

Table 5 presents the average percentile standing of the 66 
seniors in the components of the Kuder Preference Record. 

The average senior scores above average in Mechanical (91 
percentile), Scientific (93 percentile), and Social Service (67 
percentile) on the Kuder Preference Record. Mechanical inter- 
est scores do not correlate with either Theory and Technique 
or Practicum criterion scores (Table 2). Scientific interest 
scores correlate positively and significantly with Theory and 
Technique criterion scores but not with Practicum, and Social 
Service interest scores correlate positively and significantly with 
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Practicum criterion scores but not with Theory and Technique. 
These findings indicate that interest patterns are related to the 
marks earned in Dental School, but the relationships appear to 
be more specific than the relationships between personality 
measures and marks earned. 

The failure of the scores in the three components of the 
Kuder Preference Record to correlate more highly with cri- 
terion scores could have been due to the narrow spread of the 
scores (see average percentiles, Table 5), particularly in Me- 
chanical and Scientific interest scores. It is well known that 
the coefficient of correlation is affected by the variability of 
scores in the group tested. To test this idea adequately it will 
be necessary to administer the Preference Record to a group of 
freshmen and a group of seniors in a follow-up study. 

The deviation of an individual’s interest scores from pattern 
(Table 5) may also be an index of the extent to which the scale 
places those freshmen and seniors standing low in Mechanical, 
Scientific, and Social Service interest scores low in criterion 
scores and places those scoring high in these same components 
high in criterion scores. 

To determine whether or not there were statistically signifi- 
cant differences between the mean scores of freshmen and 
seniors on the MacQuarrie and the California Test of Person- 
ality, critical ratios were obtained. These critical ratios indi- 
cate that there is no significant group difference between fresh- 
men and seniors on the MacQuarrie. There is a significant 
difference in favor of the seniors in Self-Adjustment scores on 
the California Test of Personality. There are 94/100 chances 
that a difference in favor of the seniors in Social-Adjustment 
scores is real. There are 99/100 chances that a difference in 
favor of the seniors in Total Adjustment scores is real. 


Summary 


(1) Seniors in dentistry score above average in Mechanical 
(91 percentile), Scientific (93 percentile), and Social Service 
(67 percentile) interest scores on the Kuder Preference Record. 
Mechanical interest scores do not correlate with either Theory 
and Technique or Practicum criterion scores. Scientific inter- 
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est scores correlate positively with Theory and Technique cri- 
terion scores and Social Service interest scores correlate posi- 
tively with Practicum criterion scores. 

(2) Statistically significant differences in favor of the den- 
tistry seniors over dentistry freshmen in Self-Adjustment, 
Social-Adjustment, and Total Adjustment scores on the Cali- 
fornia Test of Personality were obtained. Whether or not these 
differences are due to selection, age, training, or all three is not 
known. 

(3) The California Test of Personality gave statistically 
significant positive correlations between Practicum criterion 
scores and Social-Adjustment and Total Adjustment scores for 
freshmen and between Practicum criterion scores and Self- 
Adjustment, Social-Adjustment, and Total Adjustment scores 
for seniors. Statistically significant positive correlations be- 
tween Theory and Technique criterion scores and Self-Adjust- 
ment, Social-Adjustment and Total Adjustment scores were 
obtained for seniors. 

(4) The results of this investigation indicate that correlat- 
ing MacQuarrie total scores with criteria may obscure relation- 
ships of sub-test scores to criteria. 

Personality and interest scale scores show some relationship 
in this study to criteria of success in Dental School, but the cor- 
relations are not of sufficient magnitude to be useful in indi- 
vidual prediction when selecting applicants for admission to 
the College of Dentistry. 
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THE WORD-DEXTERITY TEST, A BETTER 
MEASURE OF COLLEGE 
APTITUDE 


SHAILER PETERSON 
The University of Chicago 


THE purpose of this article is to describe a measuring instru- 
ment which, as an aptitude test, compares favorably with ex- 
aminations commonly given during Freshman Week. This test 
has been tried at junior-high-school, senior-high-school and col- 
lege level and has proved itself a valuable predictor of school 
and course grades. 

The origin of this examination dates from work in remedial 
reading carried on at the University of Oregon in 1933, in coop- 
eration with the late Dr. DeBusk. Experimentation indicated 
that there was a distinct improvement in working vocabulary 
as soon as the student could see the carry-over in word meaning 
from one word to another. While this seemed to be particu- 
larly marked in the field of the natural sciences, other work has 
indicated that this ability aids students in other areas. The 
first instruments were not tests but instead were primarily 
teaching devices and intended for students who required reme- 
dial assistance to improve their school marks. New instruc- 
tional devices were prepared at Lebanon High School and at 
the University of Oregon High School. At this time, the author 
described these first efforts in The English Journal.t Later at 
the University of Minnesota, with the encouragement of Dr. 
Alvin C. Eurich, Dr. Guy Bond, and Dr. Palmer O. Johnson, 
more experimental work was conducted on this same project. 
Still more recently at South Dakota State College there was 
further opportunity for the author to observe the value of the 
W ord-Dexterity Test that had been developed. 


1 Peterson, Shailer A. “Teaching the Special Vocabularies.” The English Jour- 
nal (College Edition), XXV (1934), 53-56. 
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The main objectives of the test in its present form are to 
determine to what extent the student knows the meaning of 
certain suffixes and prefixes in common use and also to deter- 
mine if he can transplant the meaning of a suffix or prefix found 
in one word, whose meaning is known, to another word in which 
the same suffix or prefix is found. From an examination of the 
test items illustrated in this article, the reader can understand 
that while pure memory of words, suffixes, and prefixes will 
assist the student in securing a high score, still when the range 
of word difficulty is at the proper level, the test becomes essen- 
tially a problem for the student to demonstrate his “dexterity” 
at manipulating word parts and word meanings. Mechanically, 
this also becomes an “analogy” type test or one which tests for 
configuration and pattern. 

From an examination of the following directions and sample 
test items, the reader will be able to understand the construc- 
tion of the entire test. It is a power test rather than a speed 
test and the fifty-item test described here could be administered 
to senior-high-school or college students in 40 minutes. The 
words used in the examination were chosen after consideration 
of their Thorndike word count. These words, in addition to 
having a fairly extended range of difficulty, all employed suffixes 
and prefixes which in turn were to be found in at least four or 
five very common words as revealed by their Thorndike word 
count. 

Directions 


In this test, there are many scientific words that you already 
know. Of the others, you will find that in many cases you will be 
able to discover their meaning as you proceed. 

In the example, the word part SUB is considered. There are 
many words in our English language containing SUB. One of these 
words, SUBMARINE, has been printed on the line alongside. There 
is a space for two other words containing this same word part, SUB, 
and in this example, the two words, SUBWAY and SUBNORMAL 


have been written in. 


Example: m= -nteni. tubwooy, Autretoad 














Each of the three words, SUBMARINE, SUBWAY, and SUB- 
NORMAL, contain the same common part SUB. The first word 
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refers to an under-water boat; the second word refers to an under- 
ground railway; while the last word refers to things or conditions that 
are below or under normal. All three of these words not only have 
the same word part, SUB, but they all have the same meaning, 
“under.” 

On another line beside each of the sections, there is a group of 
definitions, one of which is correct. You are to select the correct 
definition and place its letter in the space in the margin. As you will 
see in the following part of the example, definition “B” best describes 


the word part, SUB. 





Example: SUB (A) good quality, excellent; (B) beneath, below, 


/ underneath; (C) mathematical, scientific; (D) ‘slow, 
delayed; (E) similar in appearance, alike, a 
duplicate, 











While there is a space in each section for you to write only two 
additional words, you may be able to think of more than two. In 
some cases you may have difficulty in thinking of more than one ad- 
ditional one. The more words that you can think of, the easier it 
will be to decide what the true meaning is, for as in the case of the 
words submarine, subway, and subnormal, each word is alike in two 
respects. These words have the similar "word part SUB and also 
have the similar meaning, “under.” 

You will find that this is true of other words also. Whether you 
can think of five new words or only one or two, try to decide on the 
best definition for the word part under consideration. 

DON’T WASTE TIME. DO FIRST THOSE ITEMS 
WHICH ARE EASIEST FOR YOU. 


Test Items 


3. MIS misunderstand 


(A) wrong; (B) small, tiny, petty; (C) ex- 
cuse, reason; (D) denotes feminine gender; 
(E) poor, inferior. 





4. CO cooperate 
(A) with, together; (B) two; (C) without 
stopping; (D) aid, help; (E) easy, ready. 

5. IST chemist 
(A) science, logical study, investigation; (B) 
schooling, training, preparation; (C) forming 


agent nouns; (D) helping, assisting, aiding; 
(E) including, possessing, having. 


6. LOGY mineralogy 














(A) weather, climate, temperature; (B) min- 
ing; (C) prediction, estimation; (D) discourse, 
theory, doctrine; (E) tired, weary, slow. 
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10. 


11. 





32. 





34. 





39. 





47. 





48. 





49. 





GRAPH 


ANTI 


SCRIB 


POLY 


GEN 


PATH 


DECI 


DUCT 


ANTE 


biography 





(A) living things, life; (B) important, distin- 
guished; (C) soil, ground, rocks; (D) draw- 
ing, writing; (E) a study, a science. 


antislavery 





(A) treatment, medical aid; (B) old, histori- 
cal, not used; (C) in front of, before, preced- 
ing; (D) opposite, against, instead of; (E) a 
study, a science. 


transcribe 





(A) radio, phonograph; (B) circular, round; 
(C) write; (D) duplicate; (E) voice, vocal. 


polytechnic 





(A) arithmetic, numbers; (B) many, much, 
often; (C) few, not many; (D) figures, dia- 
grams; (E) college, school, study. 


genealogy 





(A) science, study; (B) birth, born, descent; 
(C) related, corresponding; (D) electrical; 
(E) Bible, Biblical. 


pathology 





(A) scientific; (B) anger; (C) suffering; dis- 
ease; (D) judgment, estimation; (E) germs, 
bacteria. 


decimal 





(A) arithmetic process, multiplication, etc.; 
(B) half; (C) ten, ten times; (D) almost, 


nearly, quite; (E) accurate, scientific, sound. 
ductile 





(A) save, preserve; (B) stretch, lengthen; 
(C) press, flatten, squeeze; (D) able to be 
attracted, attractive; (E) lead, direct, guide. 


antedate 





(A) outside, beyond; (B) behind, beside; (C) 
in front of, preceding, before; (D) old, histori- 
cal, not used; (E) not, never, none. 


In these items, the student is asked to write down a group 
of other words each of which contains some of the same prefixes 
and suffixes. This is to assist him in assigning word meaning 
to the word parts in the unknown word. The test is scored, 
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however, on the basis of correct responses to the word’s mean- 
ing and ordinarily no attention is paid to the particular “assist” 
words that he writes down. If this examination were adminis- 
tered in a situation where attention was to be given to remedial 
work, then the character of these “assist” words would be 
important. 

Table 1 shows the product-moment correlation coefficients 
between different variates in a college group of nearly three 


TABLE 1 


Correlation Coefficients between Different Variates in College Group 
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hundred students. From this table, it is evident that the Word 
Dexterity Test predicts grade-point average, English grades, 
mathematics grades, or chemistry grades better than most of 
the other examinations with which it was compared. Grades 
themselves in some subjects were better predictors of total 
grade-point average, but this can be explained by the fact that 
the individual course grades were themselves a part of the 
grand average. 

At the high-school level, the product-moment correlation 
between the Word Dexterity Examination and I.Q. was +.58 
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and with mental age, it was+.70. The point biserial correla- 
tion? with years of science was + .20 and with years of foreign 
language was +.39. The biserial correlation with science grades 
was + .52. 

While the scores for the 7th- and 8th-grade students are 
considerably lower than those for any of the other groups, it is 
interesting to observe that there is considerable overlapping of 
the scores for the upper grades. Figure 1 illustrates the per- 
centile rank curves for the seven grade levels. 
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The estimated reliability of the forty-minute Word-Dex- 
terity Test was + .90 to +. 93 depending upon the groups. These 
computations were made by the Hoyt method® and by Formula 
20 of the Kuder-Richardson method,‘ each of which gives 
underestimates of the true reliability. In order to evaluate the 
function of each of the items in the examination, the item-test 


2 The point biserial r does not assume a normal distribution of the variate and 
is in these instances a better estimate than the biserial 7 that does assume a normal 
distribution. 

3 Hoyt, Cyril. “The Reliability Estimated by Analysis of Variance.” Psycho- 
metrika, 4 (1941), 153-160. 

4 Richardson, M. W. and Kuder, G. F. “The Calculation of Test Reliability 
Coefficients Based on the Method of Rational Equivalence.” Journal of Educational 
Psychology, XXX (1939), 682-683. 














THE WORD-DEXTERITY TEST 313 


correlation for each of the items was computed both on the 
examination administered at the high-school level and on the 
one administered at the college level. Table 2 provides a fre- 


TABLE 2 


Distribution of Item-Test Correlations Based upon Top and Bottom Quarters 








Frequency of Items 





Item-Test Correlation 





High-School Group College Group 

+.8 to 9 3 1 
.7 to 8 10 3 
6 to .7 16 12 

5 to .6 8 13 
4 to .5 10 6 

3 to 4 2 9 

2 to 3 0 5 

1 to .2 1 1 
.0 to .1 0 0 
Median Value of r + .63 + 53 





quency distribution of the fifty items according to the correla- 
tions that they were found to have with the total test, a mea- 
sure of item validity when the total test is the criterion measure. 
The high item-test correlations for all items in the examination 
must be interpreted to mean that there is a commonality of 
purpose and function for all items in the examination and that 
whatever is being measured by the entire examination is being 
contributed to materially by each item in the examination. 

Sample copies of the author’s Word-Dexterity Test are 
available directly from him and permission to duplicate this 
examination can be secured. 




















A STUDY OF THE KUDER PREFERENCE RECORD 


DANIEL J. BOLANOVICH anp CHARLES H. GOODMAN 
Radio Corporation of America 


Art the beginning of the war, one of the major problems that 
faced industry was the need for highly trained technical per- 
sonnel. The Radio Corporation of America, like many other 
companies faced with a shortage of engineers, planned to meet 
this need with a short intensive training program for young 
women. The program sponsored by RCA was a ten-month 
course in the theory and practice of electronic engineering’ and 
was given at Purdue University. 

Because of the importance of obtaining a maximum number 
of technical personnel who would be successful in the course, 
psychological tests were used as part of the selection process. 
The findings which were obtained as a result of the use of these 
psychological tests will be published at a later date. 

Three hundred applicants from various colleges of thirty-six 
states and from RCA’s six manufacturing plants applied for the 
course. Of this number eighty-six were finally selected. The 
girls selected were called RCA Cadettes. Their ages ranged 
from eighteen to twenty-nine years, with a median age of nine- 
teen years. Five Cadettes had had no previous college experi- 
ence, sixty-five of the group had either one or two years of col- 
lege work when they entered the course, and one Cadette had 
had graduate work. 

The positions which the Cadettes were expected to fill were 
varied and included such activities as writing instruction man- 
uals, design, drafting, statistical analysis, laboratory assembly, 
and equipment testing. Because of the variety of jobs open to 
the Cadettes it was felt that it would be helpful in placing them 


1 The courses given in the training program were: Mathematics, Shop, Drawing, 
A-c. D-c., Radio Manufacturing, Communications, Radio, Measurements, Electronics. 
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if their interests could be determined. One month after the 
Cadettes had been at Purdue University the Kuder Preference 
Record was administered to them. The reasons for using the 
Kuder were: (1) it offered interest areas which appeared 
related to the jobs that were to be filled; (2) it offered possi- 
bilities of determining the preferences of the Cadettes for plac- 
ment; (3) it offered an opportunity to examine the Kuder as a 
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Fic. 1. Median raw scores of the Cadettes converted into percentile equivalents 
on the Kuder norms. 


possible selection device for future training programs; and (4) 
it was easy to score. 

The results reported in this paper are based upon the scores 
obtained from the sixty-six Cadettes for whom there were com- 
plete data. Upon completion of the scoring of the Preference 
Record, an individual profile was constructed for each Cadette. 
In order to obtain a graphic profile that would be representa- 
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tive of the Cadettes as a group the median scores were calcu- 
lated for the Cadettes on each category of the Preference 
Record. 

Figure 1 shows the profile constructed by plotting these 
median scores after converting them into the percentile equiva- 
lents given by Kuder in his norms for college freshmen.” It is 
interesting to note from Figure 1 that the Cadettes as a group 
stand at the 84th percentile in the Mechanical preference area 
and at the 92nd percentile in the Scientific preference area. 
There is, of course, the possibility that the Cadettes’ responses 
may have been influenced by their desire to appear highly inter- 
ested in scientific and mechanical pursuits because of the nature 
of the course they were taking. It should be remembered, 
however, that the Kuder Preference Record played no part in 
the selection process of the Cadettes. The Cadettes appear to 
be similar to the norm population in Computational, Artistic, 
and Social Service preferences. They are, however, consider- 
ably below the median of the norm population in Persuasive, 
Clerical, and Musical preferences. 

The graphs in Figure 2 show the percentage distributions 
of the Cadettes’ scores on each of the preference categories. 
The norm graph shown was constructed in order to facilitate a 
visual comparison of the percentage distributions found for the 
Cadettes on each of the preference categories. The base lines 
of each graph have been divided at interval points correspond- 
ing to the 10th, 25th, 50th, 75th, and 90th percentiles for the 
norm group. Each graph can then be compared to the norm 
graph. 

Figure 2 appears to support the evidence of Figure 1 of the 
selection of a group particularly interested in the Scientific and 
Mechanical categories, a finding similar to that obtained by 
Goodman’ for engineering students. There are, however, a few 
Cadettes below the norm medians on these keys. The distri- 
bution of the percentages on the Clerica! graph is the reverse 

2 Norms for women only were not available at the time this study was made. 
Use of norms for high-school girls, which have since been published, would not change 
the conclusions reached in this paper. 

3 Cf. Goodman, C. H. “A Comparison of the Interests and Personality Traits 


of Engineers and Liberal Arts Students.” The Journal of Applied Psychology, XXVI 
(1942), 721-737. 
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of those on the Mechanical and Scientific categories. Forty- 
four per cent of the Cadettes fall below the 10th percentile and 
only four per cent are above the 75th percentile. The majority 
of the Cadettes fall below the 25th percentile in Persuasive 
preference, although there are five per cent above the 90th per- 
centile who might fit into jobs where considerable contact with 
others is necessary, such as quality control inspectors. The 
Cadettes tend to fall in the middle ranges in Computational 
Preferences. 

These graphs were useful in helping to determine final place- 
ment of the Cadettes. The Cadettes in the extreme ten per 
cent groups of each Preference category were given particular 
attention for possible placement in jobs related to their prefer 
ences. For example, some of the jobs to be filled involved 
mechanical work, scientific work, clerical work and literary 
work: The histograms in Figure 2 were also used in counseling 
the Cadettes and aiding them in the interpretation of their own 
individual profiles. 

To evaluate the possibilities of the Preference Record as a 
selection tool for similar programs, correlations of the Cadettes’ 
Preference scores with their final grade averages for the entire 
course were obtained. It was found that first- and second- 
semester grade averages of the Cadettes correlated .81, which 
would indicate fairly close agreement between the grades of the 
first and second semesters. As a result total grade averages 
were taken as the criterion. Table 1 gives the frequency dis- 


TABLE 1 
Distribution of Cadettes’ Total Grade Averages 








Total Grade Averages Number of Cadettes 
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tribution of grade averages, and shows a range from 3.0 to 5.9. 
The grades for each subject were given in whole numbers rang- 
ing from 2 to6. Grade averages were obtained by multiplying 
each subject grade by the number of credit hours and dividing 
the total for all subjects by the total number of hours for the 
course. 

Table 2 shows the correlations of grade averages with the 
various Preference scores. Since none of the correlations attain 
statistical significance* (even at the five per cent level) the 
Preference scores cannot be considered as predictive of success 
in the Cadette training course. 

For further clarification of the relationship between Prefer- 
ence scores and success in the training course, an analysis was 


TABLE 2 
Correlations of the Kuder Preference Record Scores with Total Grade Averages 
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made of the scores of three sub-groups—(1) the most success- 
ful students, (2) the least successful students, and (3) those 
who terminated their training before the course was completed. 
Figure 3 graphically shows the median scores converted into 
their percentile equivalents on the Kuder norms of 16 Cadettes 
with grade averages of 4.9 or above, 13 Cadettes with grade 
averages of 3.8 and below, and the 8 Cadettes who dropped out 
of the course for reasons other than those attributed to illness. 
For groups as small as these, the proximity of the curves for the 
most successful and least successful Cadettes who completed 
the course appears pronounced. Figure 3 seems to bear out the 
low correlations of Table 2. The largest differences between 


* Lindquist, E. F. Statistical Analysis in Educational Research. New York: 
Houghton Mifflin Co., 1940. P. 212. 

















KUDER PREFERENCE RECORD 321 


the most successful and the least successful Cadettes appear to 
be on the Computational, Artistic, Musical, and Social Service 
categories. 

The profile of the terminated Cadettes appears to deviate 
from the patterns of the groups who completed the course. The 
terminated Cadettes have lower median percentile scores in 
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Fic. 3. Comparison of Median Raw Scores for the Most Successful, Least 
Successful and Terminated Cadettes Converted into Percentile Equivalents on the 
Kuder Norms. 

Mechanical, Computational, and Scientific preferences, and 
higher median percentile scores in the Persuasive, Literary, and 
Social Service categories. 

In order to determine the significance of the differences 
shown in Figure 3, ¢ ratios were computed using Student’s tech- 
nique for small sample distributions. Table 3 shows the median 
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raw scores for the three sub-groups on each of the Kuder prefer- 
ence scales. The ¢ ratios obtained for each group, when com- 
pared with the other two groups and the probability values for 
these t ratios, are shown in Table 4. 

Accepting Fisher’s criterion of a 5% probability as indica- 
tive of a significant difference, it will be found that there are 


TABLE 3 


Median Raw Scores of the Most Successful, Least Successful, and Terminated 
Cadettes on the Kuder Preference Record 
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Terminated (N= 8) | 59.5 | 24.5 | 55.5 | 74.5 | 51.5 | 60.5 | 22.0 | 81.0 | 40.5 





























only five such P values in Table 4. The five instances of sig- 
nificant differences are: 
Computational—Most successful Cadettes score signifi- 
cantly higher than the least successful 
Cadettes. 
Computational—Most successful Cadettes score signifi- 
cantly higher than the terminated Ca- 


dettes. 
Scientific— Most successful Cadettes score signifi- 
cantly higher than terminated Cadettes. 
Persuasive— Terminated Cadettes score significantly 
higher than most successful Cadettes. 
Persuasive— Terminated Cadettes score significantly 


higher than least successful Cadettes. 
The results indicate that the most successful Cadettes are dif- 
ferentiated from the least successful Cadettes on the Computa- 
tional key only. The most successful Cadettes are differenti- 
ated from the terminated Cadettes on the Computational and 
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Scientific keys. The terminated group is differentiated from 
the most successful and least successful Cadettes by higher 
scores on the Persuasive scale. On the whole the group of 
terminated Cadettes appears to approximate more nearly the 
pattern of preferences of Kuder’s college freshmen. 

The above differences provide the only evidence that may 
indicate any possible utility of the Kuder Preference Record as 
a selection device for future applicants in a training course such 
as the one described. 

Summary 


This study has presented data based upon the Kuder Prefer- 
ence Record scores of 66 Cadettes enrolled in a training pro- 
gram for electrical engineering aides. Investigation was made 
of the possibilities of the Kuder Preference Record as a selection 
device and its usefulness as a counseling and placement tool. 
The findings show that the Cadettes, as a group, differed from 
the college freshmen norm group, having particularly strong 
preferences in Mechanical and Scientific pursuits. “There was 
a pronounced lack of preferences in Persuasive, Clerical, and 
Musical endeavors. Score distributions showed, however, wide 
differences among the Cadettes on each scale. There were, 
however, a few Cadettes who showed a lack of preferences in 
Scientific and Mechanical fields while a few ranked very high 
in Clerical, Computational, Literary, and Persuasive prefer- 
ences. These distributions, together with the individual pro- 
files, provided some indications for counseling and placing the 
Cadettes. 

The scales of the Preference Record showed low correlations 
with final grade averages for the course. The highest correla- 
tion was .18 for the Computational key with final grade aver- 
age. 
The profiles of the most successful Cadettes and the least 
successful were similar. The group profile of the terminated 
Cadettes approached that of the average college freshmen. 
These terminated Cadettes made significantly lower scores on 
the Scientific and Computational preferences than the most 
successful Cadettes. The terminated Cadettes made signifi- 
cantly higher scores than the most successful and least success- 
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ful Cadettes on the Persuasive scale. The most successful 
Cadettes made significantly higher scores than the least suc- 
cessful Cadettes on the Computational key. 

The following conclusions appear to be justified: 

(1) On the basis of the correlations with total grade aver- 
ages, the Kuder Preference Record does not appear to be a 
promising selection device for predicting course achievement of 
female engineering Cadettes. It does, however, show some 
promise as a device for eliminating those who would be likely 
to drop out before completion of the course. 

(2) The Kuder scores afford some indications that can be 
helpful in counseling and placement, especially in a situation 
where there is a variety of job openings. 





























DEVELOPING A SERVICE RATING SYSTEM 
IRENEUS S. SMITH 


Civil Service Commission of San Francisco 


Tue “San Francisco System” of service rating was put into 
effect by the Civil Service Commission of San Francisco on 
July 1, 1944, for probationary employees, and shortly will be 
applied to nearly 15,000 employees of the County and City of 
San Francisco. Before the system was developed and adopted, 
a comprehensive survey was made of various systems in use. 
This paper is concerned with a presentation of the findings of 
our survey and a description of the approach finally taken in 
developing the “San Francisco System.” 

Our research took many months, covering most of the major 
federal, state, and municipal personnel jurisdictions throughout 
the United States and Canada. We received the finest sort of 
co-operation almost uniformly, and in our analysis of the differ- 
ent systems in use, criticism is not intended. We analyzed each 
in relation to the specific problems facing us in developing a 
rating system that would be of the most value for our own local 
jurisdiction. 

The problem of rating employee performance is still in the 
experimental stage and, up to comparatively recent times, rat- 
ings were considered more or less unreliable. Personnel agen- 
cies have lately devoted a very large amount of research to the 
problem, while leading personnel writers and publications have 
exhaustively gone into the field. 

From San Francisco’s viewpoint, the purpose of inaugurat- 
ing a service rating system was to develop a means of attracting 
more capable workers into public service as a career, raising the 
caliber of city personnel, in line with the merit system trend 
to further eliminate the “spoils system” from public service. 
Such a service rating system was authorized by our city charter. 
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Our national survey showed that service rating has proved 
of real value in raising civil service standards. It is desired 
generally by competent employees since recognition may be 
given to faithfulness and merit as factors in promotional exam- 
inations as well as in eligibility for salary increases, transfers, 
leaves of absence, and other civil service privileges. 

From the appointing officer’s viewpoint, service rating pro- 
vides an impartial yardstick by which he can measure employee 
performance, and also provides a definite incentive by which 
the majority of employees seek improvement, to that extent 
bettering the performance of his department and at the same 
time simplifying his personnel problem. From this definite 
national consensus, we felt assured in following through with 
development and inauguration of our own system. 

We also found that regardless of the form of the rating 
device there are, in the opinion of administrators generally, 
certain requisites to the success of any employee rating plan: 

First: The supervisors or executives who are asked to rate 
their subordinates must be sold on the idea. This result may 
be brought about by a series of conferences with such super- 
visors conducted by officials of the personnel department. 

Second: The employees themselves should be sold on the 
idea. When merit ratings are properly sold, investigators re- 
port that a large percentage of employees like to be rated and 
will give complete co-operation to any rating plan that they 
consider to be fair. 

Third: The supervisors must be trained to rate those under 
them. One of the principal weaknesses of rating systems ap- 
pears to be the lack of uniformity of ratings between different 
departments or bureaus. In one department, for example, a 
supervisor may set very low standards of performance and rate 
85% of his employees “excellent” and the remaining 15% “very 
good.” In another department the supervisor may set a very 
much higher standard of performance and rate 85% of his em- 
ployees good and distribute the other 15% anywhere along the 
scale from excellent to unsatisfactory. In attempting to solve 
the problem of maintaining reasonable uniformity of service 
rating standards some jurisdictions—notably, the United States 
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Housing Authority, the Tennessee Valley Authority, and the 
New York State service—have adopted a plan based on the 
idea that the categories of “excellent” and “unsatisfactory” 
should be reserved for those employees who distinguish them- 
selves from their fellow workers unmistakably, to either the 
advantage or the disadvantage of the service; and when such 
ratings are given, the supervisor is required to cite on the back 
of the rating form, or on a special form, concrete evidence of 
such service. 

Fourth: The rater should take the ratee into his confidence, 
show him his ratings, and discuss them with him. When the 
employee knows what traits or qualities are judged to be un- 
satisfactory, his normal reaction would be to seek self-improve- 
ment. Especially will this be the case when ratings are used 
as a factor in promotional examinations; and also in determin- 
ing the order of layoff and reemployment, and eligibility for 
salary increases, transfers, leaves of absence, and other privi- 
leges. How the employee does his work must be determined 
in most cases by the employee’s immediate supervisors. One 
of the problems therefore is to secure reliable ratings from vari- 
ous supervisory officers. In this respect there are two schools 
of thought. One maintains that the actual assignment of a 
rating should be made by the supervisors who are directly in 
charge of the employees. The other and more recent school 
confines the supervisor to the reporting of significant behaviors 
or activities, relying for the actual valuation upon some me- 
chanical scoring system administered by the central personnel 
staff. 

In devising a rating system the two principal technical 
problems have to do with the selection of factors to be rated 
and the comparative evaluation of such factors. Authorities 
generally are agreed that the objectivity and reliability of rat- 
ings increase as the factors involved become more specific. The 
chief fault to be found with rating plans in the past is that such 
plans provided for over-all ratings by the supervisor. Present 
practice is opposed to reliance upon such highly subjective esti- 
mates. It prefers, overwhelmingly, ratings which are objec- 
tive. The maximum degree of objectivity is approached if the 
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attention of the supervisor is focused on specific job behaviors 
or activities. In addition, each item to be reported or evalu- 
ated should be clearly described in simple, unambiguous terms 
and as concretely as possible. 

The point should be stressed, however, that no matter how 
carefully factors have been selected and no matter how care- 
fully supervisors observe and report the behaviors of their 
employees, the whole system will break down unless the actual 
rating based on these is sound. Almost without exception, in 
the opinion of Mosher and Kingsley, the devices that have been 
developed to evaluate employee services have failed to provide 
an adequate measuring instrument. 

One method of rating the efficiency of employees, especially 
in industry, is through the use of production records. Such 
positions as stenographer, typist, file clerk, machine operator, 
or copyist lend themselves to unit measurement. In some of 
the larger federal establishments, such as the Farm Credit Asso- 
ciation and the Federal Reserve Bank of New York, the num- 
ber of pages typed, the number of errors, and the general 
appearance of the work are taken into consideration by super- 
visors when typists are rated. 

Another method is through the use of periodic tests. This 
method has been extensively employed in some of the federal 
bureaus for rating the efficiency of their employees. In the 
Department of Justice, for example, the service ratings of 
stenographers and typists depend in part upon their standing 
in periodic speed tests, while performance on a periodic sorting 
test is one basis for determining the rating of postal clerks. 
This method of rating, however, is adapted only to routine and 
repetitive jobs. 

The method in most general use is that of rating schedules. 
A description of some of the schedules in use in other jurisdic- 
tions follows: ; 

1. The Graphic Rating Scale. This, according to many 
personnel experts, is the most popular rating method in use, 
being widely employed in private concerns as well as by a num- 
ber of public personnel administrations. It consists essentially 
of two elements: (1) a list of traits or activities arrived at by 
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an analysis of factors leading to success or making for failure 
on the job; (2) various descriptive phrases or adjectives denot- 
ing the several degrees of a particular activity or trait. The 
form of the device is as follows: 


I | | | | | | | | | | 





Knowledge Thoroughly Well in- Adequate _ Limited Inadequate 

of work familiar formed, knowledge, knowledge compre- 
with all has mas- knows job _ of job hension 
phases of tered most fairly of work 
work details well 





The descriptive phrases serve as a guide to the rater who is 
instructed to place a check mark along the line in one of the 
10 boxes above the phrase or between the phrases at the point 
which, in the rater’s opinion, represents the degree of the par- 
ticular quality possessed by the ratee. 

This method is open to the objection that there is no stand- 
ard unit of measurement involved and the device is not really 
a scale at all. It is a schedule the various items of which are 
arbitrarily weighted and given a numerical value through the 
use of a scoring stencil. The device is scored as a rule by the 
application of a ten-point stencil to the straight line. 

2. The Probst Service Rating System. This system, devel- 
oped by J. B. Probst, the Chief Examiner of the St. Paul Civil 
Service Commission, in 1928 was in the opinion of personnel 
administrators the only system which up to that time showed 
any real marks of merit. It has been tried out in a number of 
civil service jurisdictions including the Cincinnati, Detroit, Los 
Angeles, and California State public services. Mr. Probst dis- 
carded the scale idea previously mentioned and substituted for 
it a list of about 100 characteristics or modes of behavior. The 
rater checks only those items which are known to describe the 
ratee; the rest he leaves blank. In addition, he is not required 
to measure relative degrees of a quality in the ratee. Accord- 
ing to its author the scheme is so designed that failure on the 
part of the scoring official to correctly and conscientiously check 
the employee’s traits can be ascertained at a glance. It has 
the added advantage that elaborate instructions to reporting 
officers are unnecessary. 














332 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The actual evaluation of employee service is made by the 
use of a mechanical scoring system administered by the central 
personnel office. This scoring system has been one of the prin- 
cipal bases of criticism of the plan because of the difficulty in 
understanding it. 

3. California Report of Performance Plan. In 1938 the 
California State Personnel Board abandoned the Probst Sys- 
tem then in use for one which, in the opinion of the Board, more 
satisfactorily and adequately measured the performance of 
State employees. A separate report sheet for different types 
of work containing a particular combination of work character- 
istics was developed. At the present time 45 different report 
forms each containing a separate list of factors to be rated are 
employed. There are five possible gradations of markings for 
each item, represented by five columns. These columns are 
defined in terms of the extent to which an item is characteristic 
of the work of an employee. The individual in highest author- 
ity who is in intimate contact with the work of the employee 
acts as the “reporting” officer. After the reporting officer has 
prepared the report he must review it and discuss it with the 
employee, who is also given a copy of the report. The original 
is sent to the personnel agency for scoring. An appeal board 
has likewise been set up to hear and decide cases in which the 
supervisor and the employee disagree on the markings of a par- 
ticular report. 

4. The Los Angeles City Schools recently adopted a per- 
formance report system of the trait-rating type where four 
separate forms are used. A unique feature of this plan is that 
it was developed by representatives of employee groups and 
organizations. These groups studied and discussed the prob- 
lem for about a year before submitting the final plan to the 
personnel board. The forms are so compiled that they may be 
machine scored. These two features, and especially the first, 
might well be given favorable consideration if and when a rating 
system for all employees is put into effect. Employee support 
would thus be assured, without which, it is claimed, no plan can 
succeed. 

5. The San Francisco public schools are at present using a 
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rating form for non-certified personnel. It consists of a three- 
step rating schedule in which three degrees of each trait corre- 
sponding to excellent, good, and poor are identified by appropri- 
ate descriptive phrases arranged in three columns. Such a per- 
formance report is open to the objection that the ratings are 
too broad or, in other words, that they do not give a sufficiently 
fine graduation of the relative excellence of the employees. 

6. The Home Owners Loan Corporation utilizes a service 
rating form that consists of three traits which are rated directly 
by the rating officer on the basis of excellent, very good, good, 
fair, or unsatisfactory. This method is open to objections, 
among them these: the traits to be rated are too few to yield 
a comprehensive picture of the employee’s ability; also, the 
traits are compound ones and require an over-all rating of more 
than one type of activity. 

7. The City of St. Louis, Department of Personnel, very 
recently adopted a service rating plan according to which em- 
ployees are rated on eight specific traits and an over-all evalu- 
ation of work performance. These nine measures are felt to 
include all factors necessary to arrive at a comprehensive evalu- 
ation of employee performance and to apply to all types of posi- 
tions. This plan, like that of the H.O.L.C., is open to the criti- 
cism that the ratings are too subjective since the traits are 
rated by two different supervisors who place a check mark 
in one of five columns headed excellent, very good, good, fair, 
and poor. The subjectivity of the ratings is lessened to some 
extent, however, by a provision whereby the supervisor is re- 
quired, when a rating of either excellent or unsatisfactory on 
the over-all evaluation is given, to furnish substantiating evi- 
dence of the employee’s superior or inferior performance on the 
job. 

8. The Detroit municipal service rating report consists of 
a three-step rating schedule in which degrees of poor, satisfac- 
tory, and above average of 25 traits are rated objectively with 
the aid of descriptive phrases which identify the various degrees 
of the trait. Ten additional traits are rated where the employee 
is in supervisory charge of other employees’ work. It appears 
to have considerable merit. The number of traits rated objec- 
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tively is sufficiently large to give a comprehensive and reliable 
picture of the employee’s worth. The one criticism that might 
be leveled at the method of rating is that only three degrees of 
the trait are rated in place of the conventional five-step rating 
schedule. 

9. The Minnesota State report of employee performance is 
an adaptation of the California State employee performance 
form. It consists of a large number of traits (about 60 in all) 
which may be measured objectively in terms of the frequency 
with which these traits can be observed. The traits are checked 
in one of five columns. Different forms are used to rate differ- 
ent groups of employees. Complete instructions for use of the 
rating form are contained on the reverse side of the form. The 
large number of traits to be rated makes the scale rather cum- 
bersome; it has the merit, however, of giving a very compre- 
hensive picture of employee worth. 

10. Saginaw, Michigan, uses an employee service report of 
the graphic rating scale type. Eight traits are rated objec- 
tively with the aid of phrases descriptive of degrees of the trait 
placed under a horizontal line. The rater is required to place 
a check mark in one of nine divisions into which the horizontal 
line is divided. On the reverse side of the form are four addi- 
tional traits which apply only to certain jobs and also an over- 
all rating or general report, as it is called, of the employee’s 
worth in the opinion of the rater. The form has the merit of 
extreme simplicity and allows the rater wide latitude in his 
appraisal of the degree of each trait possessed by the ratee. 

11. A distinct departure from the conventional rating 
schedule in general use is the “Employee Guidance Sheet,” as 
it is called, recently developed by the Alabama State Personnel 
Department. The horizontal scale has been discarded in favor 
of a check-list arrangement. The five descriptive phrases 
identifying the various degrees of the ten traits to be rated are 
couched in language intended to help and encourage the em- 
ployee rather than to report findings in a coldly impersonal and 
sometimes quite blunt manner. The following sample will illus- 
trate the effort that was made to humanize the report and 
stimulate the employee to greater effort: 
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Usual Form 
Quantity of work: 
) Unusually high output. 
( ) High output. 
( ) Normal output. 
( ) Limited output. 
( ) Insufficient output; unsatisfactory. 


Alabama Siate Form 
Quantity of work: 
Just a friendly suggestion: ' 
( Exceptionally high output. Keep it up. 
( ) Better than average. Good going. 
( ) Meeting our requirements. 
( ) You could do more. Try harder. 
( ) You could do alot more. Try much harder. 


Other features include a statement in which the reporting officer 
indicates whether the duties of the position have changed since 
the last rating period and also a scale on which the department 
head indicates his opinion of the supervisor’s ability as a rater. 
All in all, this rating form seemed by far the best of any studied 
by us. It combined all the features of an exceptionally well 
developed rating system, including simplicity, ease in scoring 
and rating, and high objectivity and reliability. In addition, 
the language employed in the rating scale was so chosen that 
the employee would feel that the criticism was offered in a help- 
ful mood and in consequence should have a strong desire to 
improve his performance if it were below normal. 

The foregoing discussion is devoted to a review on a very 
limited scale of representative rating plans in use. None of 
these plans are restricted to the rating of the probationary 
period alone. In inaugurating the San Francisco employee 
rating plan, therefore, it was felt desirable, in order to acquaint 
employees and employers with the idea gradually, to limit it to 
probationers first. The San Francisco charter provides that “at 
any time during the probationary period the appointing officer 
may terminate the appointment.” Since power of dismissal is 
vested in the appointing officer and since his decision is final, 
we felt that simplicity should be the determining characteristic 
of any rating plan for such employees. Keeping in mind the 
objectives that should be inherent in every good rating device, 
we attempted to develop a plan which included the best fea- 











336 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


tures of those reviewed, with particular emphasis on the follow- 
ing points: 

1. The development of a uniform report sheet containing 
traits which will fit all types of work. 

2. Clear and concise wording of the descriptive phrases 
which measure the degree of each trait. 

3. Phrasing of steps so as to compel supervisors to give care- 
ful consideration to their markings. 

4. A sound and easily applied scoring formula. 

The San Francisco rating plan utilizes a simple check form 
applying uniformly to all similar classes of positions in the vari- 
ous city departments, with the exception of the Police and Fire 
Departments, for which separate forms have been prepared. 
The form is filled out by supervisors and double checked by 
department heads or appointing officers. 

The traits included in the form which apply to all employees 
are as follows: Promptness, Attendance, Quality of Work, 
Ability to Learn, Co-operation, Dependability, Judgment and 
Initiative. Additional traits which apply only to certain jobs 
are: Volume of Work, Contacts with Public, Physical Fitness, 
Appearance, and Initiative. In the case of supervisory posi- 
tions, Ability to Train and Organization of Work are included. 
The five phrases descriptive of the degree of each trait are 
arranged in the form of a check list, and excellence on the scale 
is rated sometimes at the beginning and sometimes at the end. 
Such an arrangement increases the probability that the rater 
will pay full attention to the descriptive phrases. 

On the back of the report is space for the certificate of the 
reporting officer and of the employee involved. The appointing 
officer in cases involving probationary employees is required to 
indicate his impression of the employee’s General Fitness to 
hold the position, and to indicate presence or absence of unde- 
sirable characteristics which would make the employee unsuited 
to the particular job. If the probationary employee is rejected, 
the appointing officer must indicate specifically why. Ratings 
are given probationary employees at end of the second and fifth 
months of service, and give appointing officers a legitimate 
excuse for rejection of an unsatisfactory applicant by refusal 
to certify him for permanent tenure. 
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The probationary period is a definite part of the examina- 
tion and tests those factors for which no adequate written or 
oral tests have been devised. Unfortunately, however, appoint- 
ing officers often fail to recognize their responsibility in checking 
the probationer’s services and too often the probationary period 
elapses without adequate investigation of the employee’s per- 
formance on the job. It is hoped that the use of the service 
rating system for probationers will serve to impress upon the 
appointing officer his responsibility in determining whether or 
not the probationary appointee will be a satisfactory permanent 
employee. 

We expect, for permanent employees, to rate once or twice 


per year. 


















NEW DEVELOPMENT FOR FIRE MOTOR DRIVER 
EXAMINATION 


WILLIAM E. TRUOG, JR. 
Kansas City Personnel Department, Kansas City, Missouri 


Tue development of a new performance testing procedure 
for Fire Motor Drivers followed some extensive research by the 
Personnel Department of the City merit system of Kansas City, 
Missouri. This examination is based upon a standardized test 
used by the United States Army Engineers to examine all motor 
vehicle drivers, and was developed by Amos E. Neyhart, Ad- 
ministrative Head, Institute of Public Safety of Pennsylvania 
State College and Consultant to the American Automobile 
Association. The Army examination consisted of a short 
written exercise followed by tests for visual acuity, field of 
vision, depth perception, and the applicant’s reaction to simu- 
lated traffic situations. The Army performance procedure in- 
cludes a closed area and a general road test, with an appraisal 
of the general driving characteristics of the applicant. The 
driver’s performance test developed by the Kansas City Per- 
sonnel Department is based primarily on the closed area pro- 
cedure used by the Army Engineers. 

Five tasks were selected: (1) Driving on a straight line— 
forward and backward, which consisted of lining up the front 
and rear left wheels on a painted 100-foot line, and driving at 
normal speed, to keep both front and rear wheels on the line 
for the entire distance. (2) Gauging space when steering in 
close limits. This is a timed test in which the contestant has 
to make a 90-degree turn going forward, and then to back up, 
making the same turn. (3) Stopping the car smoothly in 40 
feet while going 20 miles an hour. Lines are painted on the 
street as a guide to begin slowing down and to stop. (4) Stop- 
ping the car with the front wheel exactly on a cross painted on 
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the street. This procedure is repeated for the front bumper 
and also for the rear wheel and bumper. The contestant is 
penalized doubly for going beyond the line. (5) Parking the 
car against a curb in a regulation parallel parking. 

A road test through regular traffic conditions includes an 
appraisal of the applicant’s making a right- and a left-hand 
turn, of his starting from a standstill while on an upgrade, and 
finally, of his handling of the car and of his own reactions and 
emotional status in meeting traffic situations. A regular tour- 
ing car was used for this first test, but there is no doubt that 
fire department apparatus would provide a much more dis- 
criminating result. It was found that generally the touring 
car proved easy for everyone to maneuver through the various 
operations. 

One of the best developments of the Neyhart testing pro- 
cedure is the detailed scoring sheet devised for the use of the 
examiner. Under each task are outlined several items upon 
which the contestant is to be graded, e.g.: 


Sample Score Sheet 


III. Stopping smoothly in 40 feet at 20 miles an hour—Front Bumper. 
a. Driving Forward through Stanchions. 
1. Moves gear shift lever to another position without clashing 
gears. 
Keeps an even speed—20 M.P.H. 
Moves vehicle continuously—no stops. 
Steers with certainty—no sudden jerks. 
Does not stop between stanchions. 
Does not hit right—left stanchions. 
Does not race engine. 
Does not stall engine. 
Stops vehicle smoothly. 
Stops front bumper short—over line. 


0 NP > 


a 


A more accurate and discriminating grade is made possible by 
this system, and the driver may later see the points of the exer- 
cise that he failed to perform correctly. The examiner checks 
those points missed as the operation is being completed and 
scores each exercise before going to the next. It is believed that 
this has proved a much more objective test than others, and by 
the use of the detailed scoring sheet, the standards outlining the 
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course, and lines drawn in the street to guide the contestant, 
the result depends less on the judgment of the examiner and 
provides a positive, equal standard for all applicants. It leaves 
little room for the driver to dispute a point, as there can be no 
doubt when he hits or knocks down a standard. 

The equipment used for this test is an important factor, 
both for the applicant and the examiner. Standards painted 
white and topped with red flags are used to outline the differ- 
ent courses, supplemented by four-inch white lines painted on 
the street as a guide to the driver. The tasks for this test may 
all be laid out in an ordinary street. 

The examination for Fire Motor Driver is divided into four 
sections; a written examination, performance test, service rat- 
ing, and rating of experience and training. The written and 
performance sections are both given a valuation of 30%, with 
the service rating and rating of experience and training each 
being valued at 20% of the total. The service rating is also a 
new development and was used for the first time in this exami- 
nation. There has been no further opportunity to determine the 
validity of this new procedure, except for the correlations which 
have been computed. Following are the correlations between 
the various sections of the test: Performance and the total 
score, + .58; performance and written, + .02; written and total 
score, +.51; performance and service rating, —.17.* The low 
correlation between the performance test scores and written 
scores is regarded as particularly desirable, since the two sec- 
tions are testing for different qualities. 

The scoring and grading system for the performance test is 
based on a raw score of 100. The test is divided into ten tasks, 
each one having a valuation of 10. Points are deducted for 
each item of the operation that was performed incorrectly. A 
distribution table, based on the scores, is made up, and a pass- 
ing point is set. A conversion table is computed on the basis 
of the 30% valuation set for the performance test. The total 
raw score is then converted and added to the rest of the score 
for the entire examination. 


* Since this is the first use of the service rating on an examination, we have had 
no measurement of the validity of it. For this reason, the correlation of -.17 should 
not be regarded as seriously as the other correlations. 
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The Army Engineers and the Fire Department worked in 
full cooperation with the Personnel Department in developing 
the examination. With some further refinements, it is believed 
that this will constitute a highly valid and successful testing 


procedure. 














MEASUREMENT ABSTRACTS* 


Crider, B. “A Battery of Tests for the Dominant Eye.” Journal of General Psy- 

chology, XXXI (1944), 179-190. 

In order to meet the need for uniformity of procedures and test materials used 
to determine eye dominance, the author has developed a battery of seven simple tests 
for this purpose. Directions for making the test materials, precise instructions for 
administration, a form for recording performance, and norms based on the records of 
over 700 subjects are given. The author reports that “coefficients of reliability deter- 
mined by os the tests are all over .98 or .99.” Of the population for which 
norms are given, 53 per cent were definitely right-eyed on the battery; 24 per cent 
were definitely left-eyed; 7 per cent impartial-eyed; and the remainder showed left- 
eyed tendency or right-eyed tendency. Edith S. Jay. 





Griffin, C. H. and Borow, H. “An Engineering and Physical Science Aptitude Test.” 

Journal of Applied Psychology, XXVIII (1944), 376-387. 

By selecting portions of the Revised Iowa Physics Aptitude Test, the Moore 
Test of Arithmetic Reasoning, the Bennett Test of Mechanical Comprehension, the 
Moore-Nell Examination, and others, an aptitude test was constructed for use in 
assessing suitability of candidates for training in technical work on the high-school or 
college level. The six parts of the examination can be administered in 72 minutes 
of student working time. Percentile norms have been established separately for 
6695 men and 2295 women enrolled in war training courses at the Pennsylvania State 
College. Multiple correlations of the weighted test parts and course achievement 
were as high as .79. Reliability determined by the split-half method, corrected with 
the Spearman-Brown formula, was .96. The subtests are Mathematics, Formulation, 
Physical Science Comprehension, Arithmetic Reasoning, Verbal Comprehension, and 
Mechanical ork lan representing relatively independent aspects of technical 
performance. Edith S. 





Holzinger, K. J. “The Relationship between the Centroid Method and Spearman’s 

Method.” Journal of Educational Psychology, XXXV (1944), 347-352. 

This article points out the relationship between the centroid method of obtaining 
factor coefhcients and Spearman’s 1914 method, as well as the relationship between 
the centroid method and Spearman’s theorem on the correlation of sums. It is shown 
that if communalities are inserted in the diagonals, Spearman’s method for obtaining 
a factor coefficient is the same as the centroid method. Similarly, the method of 
correlation of sums, again with the communalities included, is equivalent to the 
centroid method, it being demonstrated that “centroid coefficients aj may thus be 
regarded as correlations between a variable zj and the total (or average) of the 
variables 21, 22, 23 projected on the common-factor space.” JL. Bouthilet. 





Kemfer, Homer. “Simplifying the Scoring Technique of the Bernreuter Personality 

Inventory.” Journal of Applied Psychology, XXVIII (1944), 412-413. 

Five simplified keys were used to replace the usual scoring technique of the 
Bernreuter Personality Inventory: I. All original values of 3 to —3 were assigned a 
value of 0; 4 and above became 1; —4 and below became-1. II. Original values of 
2 to -2 became 0; 3 and above became 1; —3 and below became -1. III. “?” 
responses were ignored and all “Yes” and “No” values having a difference of seven 


* Edited by Forrest A. Kingsbury. 
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(reduced to five on scale B3-1) or more were paired. Positive values became 1 and 
negative values became—1. IV. “?” responses were ignored and all positive items 
were counted as 1 if there was a difference of six (reduced to five on scale B3-1) or 
more between the “Yes” and “No” answers. V. All positive responses were counted 
as 1. Zero and negatives were ignored. After the new scores were derived, Pearson 
correlations were worked for each scale with results from each key. Seventeen of the 
correlations of the five keys with original percentile ranks on the four scales range 
from .82 to .91. Probable errors range from + .011 to + .032. 

It is concluded that the best new key for each trait could be used for quick loca- 
tion of extreme cases, the middle fifty per cent considered “normal,” and the fifty per 
cent at the two extremes rescored with the regular Bernreuter keys. The new keys 
effect a saving of more than 90 per cent in scoring time. Marion H. Groves. 





Lundin, R. W. “A Preliminary Report on Some New Tests of Musical Ability.” 

Journal of Applied Psychology, XXVIII (1944), 393-396. 

This describes a battery of 5 tests devised to measure directly and objectively 
musical abilities, learned rather than innate. The tests—recorded musical items on 
which “same” or “different” judgments were requested for interval discrimination, 
melodic transposition, harmonic transposition, melodic sequences and harmonic se- 
quences—were given to 2 groups of subjects: 60 music students from DePauw and 
Indiana Universities and 100 students from undergraduate classes in psychology at 
the latter. The total scores showed a reliability of .73, but the reliabilities of the 
separate tests were not so high. The only criterion for validation was the author’s 
specially constructed graphic rating scale by which the professors of music graded 
their students. Finding a significant difference between the 2 groups, the author 
believes the tests do discriminate between those with and without musical ability. 
Vernon S. Tracht. 





McCarthy, D. “A Study of the Reliability of the Goodenough Drawing Test of 

Intelligence.” Journal of Psychology, XVIII (1944), 201-216. 

The Goodenough Drawing Test of Intelligence was given two times, a week 
apart, to 386 third- and fourth-grade children. Each test was scored three times, 
twice by the same scorer, and once by different scorers. Tests scored by the same 
person gave a correlation ‘of .94, with 12.4 per cent of the cases having a discrepancy 
of one year or more. Correlation between scores by different scorers was .90, but 
discrepancies of one year or more were found in 25.3 per cent of the tests. More. 
over, although the consistency on two tests taken at different times and scored by 
one person was .68, there was a discrepancy of at least one year in 41.7 per cent of 
the cases. The odd-even reliability computed with the Spearman-Brown prophecy 
formula was .89. The results, demonstrating both the subjectivity of scoring and 
the variability in performance over a short interval, show the need for caution in 
using the scale for individual diagnosis. JL. Bouthilet. 





McClelland, David C. “Simplified Scoring of the Bernreuter Personality Inventory.” 

Journal of Applied Psychology, XXVIII (1944), 414-419. 

The scoring of the Bernreuter Personality Inventory was simplified by assigning 
a value of 1 to all answers weighted 3 or above and a value of —1 to all answers 
weighted —3 or below; answers weighted between these limits were ignored. Corre- 
lations between full and simplified scoring of the inventories of 114 college men on 
five traits were computed; they were above .95 for BIN, B4D, FIC, and F2S, and 
.84 for B2S. Formulas for converting a short score into a full score and a table of 
short score percentile norms for college men are included. The time required to 
short-score an inventory is about 1 minute per trait. Marion H. Groves. 





Mellone, M. A. “A Factorial Study of ew Tests for Young Children.” British 

Journal of Psychology, XXXV (1944), 

A battery of 14 picture tests, a pcs: Sat test, and a mechanical arithmetic 
test was given to 414 seven-year-old children, 218 boys and 196 girls. The inter- 
correlations between tests were factored by the centroid method into 3 common fac- 
tors and specifics. After rotation, the 3 factors were identified as (1) a general 
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factor found in all of the picture tests and slightly in the other tests, (2) a scholastic 
factor present in the complete battery but not in the picture-tests battery alone, and 
(3) a space factor, appearing in some of the picture tests, which showed a sex differ- 
ence, evident only i in the boys. L. Bouthilet. 





Odell, C. W. “The Scoring of Continuity or Rearrangement Tests.” Journal of 

Educational Psychology, XXXV (1944), 352-356. 

Tests in which the task is to rank up to six items in some designated order can 
be scored quickly and in a theoretically sound manner by using a table presented as 
an aid in shortening the calculations. The table is divided into sections which enable 
the scorer to penalize with minus scores for poorer than chance arrangement; to give 
zero scores to all arrangements negatively correlated with the correct order; or to 
score by the better, but longer, method which requires that differences between 
correct ranks and pupil’s responses be squared and summed. Edith S. Jay. 





Postwar Counseling on the College Campus. Reflections from the Institute on Stu- 
dent Personnel Work held at the University of California, Los Angeles, Summer 
— 1944. Published by Western Personnel Service. Pasadena, California, 

1944. 


This is a 20-page pamphlet prepared by a committee at the Institute. It aims 
to present the different points of view in a series of short articles. E. G. Williamson 
as “Leader of the Institute” emphasizes the need for a well-planned integrated per- 
sonnel program. The significance of “the contributions of the students themselves 
to their own development” is discussed by Jessie E. Gibson, “Dean of Women.” The 
“Personnel Administrator,” Karl W. Onthank, in reviewing the various papers read 
at the meetings brings out many of the practical considerations that must be met in 
counseling after the war. F. T. Perkins, as “Psychologist,” comments favorably on 
the fact that many diverse agencies are interested in the problems of counseling, but 
suggests that the faculty of colleges should be brought into the personnel program 
to a greater extent. Finally, Ruth P. McLain, representing the “Layman,” was 
impressed by the evidence that not only will colleges be faced with great demands 
in the postwar years, but that the new profession of personnel work will be of much 
value in meeting these demands. L. Bouthilet. 





Rabin, A. I. “Test Constancy and Variation in the Mentally Ill.” Journal of 

General Psychology, XXXI (1944), 231-239. 

The Wechsler-Bellevue Intelligence Scales were used on 60 New Hampshire State 
Hospital adult patients to determine the relative test-retest consistency of mental 
patients as well as the sensitivity of the measuring instrument. The results revealed 
a correlation of .84 between them and compared favorably with those obtained from 
different tests on normal persons, tending to disprove the belief that test results of 
psychotics are very unstable. The coefficient also indicated a high degree of relia- 
bility for the whole scale itself, according to the author’s conclusions. Slight but 
consistent increases in total and individual retest scores were observed, especially on 
the performance part of the scale, the magnitude being directly related to the extent 
of time between test and retest. Vernon S. Tracht. 





Tuckman, Jacob. “A Study of the Reliability of the Minnesota Rate of Manipula- 
tion Test by the Split-Half and Test-Retest Methods.” Journal of Applied 
Psychology, XXVIII (1944), 388-392. 

Since the manual for the Minnesota Rate of Manipulation Test presents no relia- 
bility data the author undertook to determine the reliability of the test. The split- 
half and test-retest techniques were used. Subjects for the split-half study (odd- 
even of the four test trials) were 386 men (17 to 56, M age 21.9), 319 women (17 to 
45, M age 21.8), 145 boys (13.4 to 18.9, M age 15. 8), and 111 girls (13.9 to 18.4, 
M age 15.6). After applying the Spearman- Brown prophecy formula the corrected 
coefficients for both Placing and Turning range from .91 to .97 with no difference 
among the four groups. S.D.’s range from 19.9 to 24.3. . An additional 100 high- 
school students—51 boys and 49 girls—with a mean age of 16.3 were subjects for 
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both a split-half and a test-retest study. The corrected split-half coefficients for all 
possible combinations of single trials range from .82 to .93. The interval between 
the initial test and the retest varied from 1 to 14 days with a median of 7 days. In 
the test-retest study a progressive though small decrease in time for each successive 
trial for both Placing and Turning on the initial and final test is revealed. All sub- 
jects were faster on the retest for both Placing and Turning. The D/o difference 
between total score on initial and final test is 6.7 for Placing and 10.4 for Turning. 
Marion H. Groves. 














NEWS NOTES 


Regional meetings of the Council of Guidance and Personnel Associations will be 
held as follows: 


ATLANTA, GEORGIA 
Dates: Feb. 22, 23, 24, 1945. 
Headquarters: Ansley Hotel. 
Chairman: Dr. W. D. Perry, Director, Military and Vocational Information, 207 
South Building, University of North Carolina, Chapel Hill, N. C. 
Program theme: A Post-War Picture of the South. 


CHICAGO, ILLINOIS 
Dates: Feb. 15, 16, 17, 1945. 
Headquarters: Morrison Hotel. 
Chairman: Mr. E. L. Kerchner, Supervisor, Employment Certificate Office, Board 
of Education, 228 North LaSalle St., Chicago, Il. 
Program theme: Mobilization of Community Counseling Services. 


DENVER, COLORADO 

Dates: Feb. 26, 27, 1945. 

Headquarters: Albany Hotel. 

Chairman: Mr. Dwight C. Baird, Counselor, Occupational Information and Guidance 
Service, State Board for Vocational Education, Room 210 State Office Building, 
Denver, Colo. 

Program theme: Mobilization of Community Counseling Services. 


NEW YORK, NEW YORK 
Dates: Feb. 21, 22, 1945. 
Headquarters: Not yet announced. 
Chairman: Dr. Forrest H. Kirkpatrick, Director, Personnel Administration, Radio 
Corporation of America, Camden, N. J. 
Program theme: Mobilization of Community Counseling Services. 


SAN FRANCISCO, CALIFORNIA 


Plans incomplete. Announcements will be made later. 


The membership of A.C.P.A. will be circularized about further details in the 
near future. 





Lucile Brown, who is with the American Red Cross, is now assigned to recre- 
ational work in rest centers. She writes that one aspect of her job is to plan tours 
to points of historical and cultural interest in southern and central Italy. 





Karl Cowdery, Associate Registrar, Stanford University, died in the early fall. 
Dr. Cowdery had been a member of A.C.P.A. for many years and had served as 
President, as Editor of the Yearbook, and as Chairman of many important com- 
mittees. His death is a great loss to the organization and to his many friends. 





Forrest Kirkpatrick has been appointed consultant on personnel administration 
for the Department of State. 
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Grace E. Manson has accepted an appointment in the Civilian Personnel Re- 
search Subsection of the Classification and Replacement Branch of the Adjutant 
General’s Office at 270 Madison Avenue, New York, N. Y. 


Harry W. Seamans, formerly General Secretary of the Penn State Christian 
Association, Pennsylvania State College, and recently Community Participation Ad- 
visor, National Housing Agency, Office of the Administrator, Washington, D. C., has 
become Chief of Employee Relations for the O.P.A., Washington, D.C. 


Western Personnel Service has published a report on “Post-war Counseling on 
the College Campus,” which gives reflections from the Institute on Student Personnel 
Work held last summer at the University of California, Los Angeles, in collaboration 
with Western Personnel Service. Contributors are E. G. Williamson, Dean of Stu- 
dents, University of Minnesota; Jessie E. Gibson, President, California Association 
of Deans of Women; Karl W. Onthank, Dean of Personnel Administration, Univer- 
sity of Oregon; F. T. Perkins, Associate Professor of Psychology, Claremont Colleges; 
and Ruth P. McLain, Secretary, Board of Directors, Western Personnel Service. 


Mrs. Chase Going Woodhouse, Connecticut College for Women, was elected to 
Congress in the recent elections. 








